From 5a53892cb8846d7b915d4bcf83f53f35b4a78d35 Mon Sep 17 00:00:00 2001 From: Nicolas Hug Date: Tue, 26 Nov 2019 18:14:32 -0500 Subject: [PATCH 1/9] WIP --- slep011/proposal.rst | 379 +++++++++++++++++++++++++++++++++++++++++++ under_review.rst | 2 +- 2 files changed, 380 insertions(+), 1 deletion(-) create mode 100644 slep011/proposal.rst diff --git a/slep011/proposal.rst b/slep011/proposal.rst new file mode 100644 index 0000000..16dc10e --- /dev/null +++ b/slep011/proposal.rst @@ -0,0 +1,379 @@ +.. _slep_010: + +===================== +SLEP011: random state +===================== + +:Author: Nicolas Hug +:Status: Under review +:Type: Standards Track +:Created: 2019-11-23 + +Current status +============== + +`random_state` parameters are used commonly in estimators, and in CV +splitters. They can be either int, RandomState instances, or None. The +parameter is stored as an attribute in init and never changes, as per our +convention. + +`fit` and `split` usually look like this:: + + def fit(self, X, y): # or split(self, X, y) + rng = check_random_state(self.random_state) + ... + rng.some_method() # directly use instance, e.g. for sampling + some_function_or_class(random_state=rng) # pass it down to other tools + # e.g. if self is a meta-estimator + +`check_random_state` behaves as follows:: + + def check_random_state(x): + if x is an int, return np.RandomState(seed=x) + if x is a RandomState instance, return x + if x is None, return numpy's RandomState singleton instance + +Accepting instances or None comes with different complications, listed below. +`None` is the current default for all estimators / splitters. + + +Problems with passing RandomState instances or None +=================================================== + +The main issue with RandomState instances and None is that it makes the +estimator stateful accross calls to `fit`. Almost all the other issues are +consequences of this fact. + +Statefulness and violation of fit idempotence +--------------------------------------------- + +Estimators should be stateless. That is, any call to `fit` should forget +whatever happened to previous calls to `fit`, provided that no warm-start is +hapenning. Whether CV splitters should be stateful or not is debatable, and +that point is discussed below. + +Another related convention is that the `fit` method should be idempotent: +calling `est.fit(X, y)` and then `est.fit(X, y)` again should yield the same +model both times. + +These properties are key for enforcing reproducibility. We have a common +checks for them. + +If a `RandomState` instance or None are used, the idemptency property may be +violated:: + + rng = RandomState(0) + est = Est(..., random_state=rng) + est.fit(...) # consume rng + est.fit(...) + +The model inferred the second time isn't the same as the previous one since +the rng has been consumed during the first called. The statefulness property +isn't respected either since the first call to `fit` has an impact on the +second. + +The same goes with passing `None`. + +A related issue is that the `rng` may be consumed outside of the estimator. +The estimator isn't "self contained" anymore and its behaviour is now +dependent on some stuff that happen outside. + +Countless bugs +-------------- + +Quoting @amueller from `14042 +`_, many bugs +have happened over the years because of RandomState instances and None. + +These bugs are often hard to find, and some of them are actual data leaks, +see e.g. `14034 +`_. Some of these +bugs have been around forever and we just haven't discovered them yet. + +The bug in `14034 +`_ is that the +validation subset for early-stopping was computed using `self.random_state`: +in a warm-starting context, that subset may be different accross multiple +calls to `fit`, since the RNG is getting consumed. We instead want the +subset to be the same for all calls to `fit`. The fix was to store a private +seed that was generated the first time `fit` is called. + +This is a typical example of many other similar bugs that we need to +monkey-patch with potentially overly complex logic. + +Cloning may return a different estimator +---------------------------------------- + +`my_clone = clone(est)` returns an *unfitted* estimator whose parameters are +(supposed to be) the same as those that were passed when `est` was +instanciated. Whether +*clone* is a good name for describing this process is another issue, but the +semantic of cloning is scikit-learn is as described above. We can think of +*cloning* as *reset with initial parameters*. + +That semantic is not respected if `est` was instanciated with an instance or +with None:: + + rng = RandomState(0) + est = Est(..., random_state=rng) + est.fit(X, y) # consume RNG here + my_clone = clone(est) + my_clone.fit(X, y) # not the same model! + +`my_clone` isn't the same as what `est` was, since the RNG has been consumed +in-between. Fitting `my_clone` on the same data will not give the same model +as `est`. While this is not desirable when an instance is passed, one might +argue that this is the desired behaviour when None is passed. + +In addition, `est` and `my_clone` are now interdependent because they share the +same `rng` instance. + +Incompatibility with third-party estimators +-------------------------------------------- + +From the snippet above in the introduction, most meta-estimators will +directly pass down `self.random_state` to their sub-estimators. Some +third-party libraries only support integers as `random_state`, not instances +or None. See e.g. `15611 +`_ + +CV-Splitters statefulness +------------------------- + +CV-splitters are stateful:: + + rng = np.random.RandomState(0) + cv = KFolds(shuffle=True, random_state=rng) + a = cv.split(X, y) + b = cv.split(X, y) # different from a + +`a` and `b` are different splits, because of how `split` is implemented (see +introduction above). + +This behaviour is inconsistent for two reasons. + +The first one is that if `rng` were an int, then `a` and `b` would have been +equal. Indeed, the behaviour of the CV splitter depends on the +**type** of the `random_state` parameter:: + +- int -> stateless, get the same splits each time you call split() +- None or instance -> stateful, get different splits each time you call split() + +Concretely, we have a method (`split`) whose behaviour depends on the *type* +of a parameter that was passed to `init`. We can argue that this is a common +pattern in object-oriented design, but in the case of the `random_state` +parameter, this is potentially confusing. + +The second inconsistency is that splitters are stateful by design, while we +want our estimators to be stateless. Granted, splitters aren't estimators. +But, quoting `@GaelVaroquaux, +`_, +consistency is one thing that we are really good at. + +So it is important to have the splitters consistent with the estimators, +w.r.t. the statelessness property. The current behaviour is not necessarily +clear for users. Fixing how random_state is handled in the splitters is one +of the entries in the `Roadmap +`_. + +Some users may rely on the current behaviour, `to implement e.g. +bootstrapping, +`_. +If we make the splitters stateless, the "old" behaviour can be easily +reproduced by simply creating new CV instances, instead of calling `split` +on the same instance. Instances are dead cheap to create. + +Potential bugs in custom parameter searches +------------------------------------------- + +(This issue is a direct consequence of the splitters being stateful.) + +We have a private API for subclassing BaseSearchCV and implementing custom +parameter search strategies. The contract is that the custom class should +override the `_run_search(evaluate_candidate, ...)` method which itself must +call the `evaluate_candidates()` closure, were `cv.split()` will be called. + +Third-party developers may only *call* `evaluate_candidates()`, not change +its content. Now, since `cv.split()` is called in `evaluate_candadates()`, +that means that `evalute_candidates()` will potentially evaluate its +candidates parameters **on different splits** each time it is called. + +This is a quite subtle issue that third-party developers might easily +overlook. + +Depending on the intended behaviour of the parameter search, this may or may +not be a good thing. This is typically a bug if we implement successive +halving + warm start (details ommitted here, you may refer to `this issue +`_ for some more +details). + +Proposed Solution +================= + +We need a solution that fixes the statefulness of the estimators and the +splitters. + +The proposed solution is to store the *state* of a RandomState instance in +`__init__`:: + + def __init__(self, ..., random_state=None): + self.random_state = check_random_state(random_state).get_state() + +That `random_state` attribute is a tuple with about 620 integers, as returned +by `get_state() +`_. + +That state is then used in `fit` or in `split` as follows:: + + def fit(self, X, y): # or split() + rng = np.random.RandomState() + rng.set_state(self.random_state) + # ... use rng as before + +Since `self.random_state` is a tuple that never changes, calling `fit` or +`split` on the same instance always gives the same results. + +We want `__init__` and `set_params/get_params` to be consistent. To that end, +we will need to special-case these methods:: + + def get_params(self): + + random_state = np.random.RandomState() + random_state.set_state(self.random_state) + return {'random_state': random_sate, ...} + + def set_params(self, ...): + + self.random_state = check_random_state(random_state).get_state() # same as in init + +`clone` does not need to be special-cased, because `get_params` does all the +work. Note that the following:: + + est.set_params(random_state=est.get_params()['random_state']) + +behaves as expected and does not change the `random_state` attribute of the +estimator. However, one should not use:: + + est.set_params(random_state=est.random_state) + +since `est.random_state` is neither an int, None or an instance: it is a tuple. +We can error with a decent message in that case. + +Advantages: + +- It fixes the statefullness issue. `fit` is now idempotent. Calling `split` on + the same instance gives the same splits. In other words, it does what we + want. + +- It is relatively simple to implement, and not too intrusive. + +- Backward compatibility is preserved between scikit-learn versions. Let A + be a version with the current behaviour (say 0.22) and let B be a version + where the new behaviour is implemented. The models and the splits obtained + will be the same in A and in B. That property may not be respected with + other solutions, see below. + +- Both RandomState instances and None are still supported. We don't need to + deprecate the use of any of them. + +- As a bonus, the `self.random_state` attribute is an *actual* random state: + it is the state of some RNG. What we currently call `random_state` is not + a state but a RNG (though this is numpy's fault.) + +Drawbacks: + +- we break our convention that `__init__` should only ever store attributes, as + they are passed in. Note however that this convention is enforced so that + the semantic of `__init__` and `set_params` are the same, and to enable + people to change public attributes without having surprising behaviour. + **This is still respected here.** So this isn't really an issue. + +- there is a subtelty that occurs when passing `None`. `check_random_state` + will return the singleton `np.random.mtrand._rand`, and we will call + `get_state()` on the singleton. The thing is, its state won't change + accross multiple instances unless the singleton is consumed. So if we do + `a = Est(random_state=None); b = Est(random_state=None)`, a and b actually + have exactly the same `random_state` attribute, since the state of the + singleton wasn't changed. One solution would be to consume the singleton's + RNG in `__init__`, with a private helper. + +- The `__repr__()` will need to special-case the `random_state` attribute to + avoid printing a long tuple. + +- We need to store about 620 integers. This is however negligible w.r.t. e.g. + the size of a typical dataset + +- It does not fix the issue about third-party packages only accepting integers. + This can however be fixed in meta-estimator, independently. + +Alternative solutions +===================== + +Store a seed instead of a state +-------------------------------- + +Instead of storing a state from `get_state()`, we could store a randomly +generated seed:: + + def __init__(self, ..., random_state=None): + self.random_state = check_random_state(random_state).randint(0. BIG_INT) + +Then instead of using `set_state` we could just use +`rng = RandomState(seed=self.random_state)` in `fit` or `split`. + +Advantages: + +- It also fixes the third-party estimators issue, since we would be passing + self.random_state which is an int +- It's cheaper than storing 620 ints +- we only need to subcase `set_params()`. `get_params()` can be left as-is, + same for `clone`. + + +Drawbacks: + +- We want that + `est.set_params(random_state=est.get_params()['random_state'])` does not + change `est.random_state`. With this logic, this is not possible to enforce. + Also, clone will not work was expected. + +- It is not backward compatible between versions. For example if you passed + an int in version A (say 0.22), then in version B (with the new + behaviour), your estimator will not start with the same RNG when `fit` is + called the first time. Same for splitters. For this reason, storing a + state may be preferable. + +Store the state in fit instead of in init +----------------------------------------- + +Instead of storing the output of `get_state()` in `__init__`, we could store it +the first time `fit()` is called. For example:: + + def fit(self): + self._random_state = getattr(self, '_random_state', check_random_state(self.random_state).get_state()) + rng = np.random.RandomState() + rng.set_state(self._random_state) + # ... + +The advantage is that we respect our convention with `__init__`. + +However, `fit` idempotency isn't respected anymore: the first call to `fit` +clearly influences all the other ones. This also introduces a private +attribute, so we would need more intrusive changes to `set_params`, +`get_params`, and `clone`. + + +References and Footnotes +------------------------ + +.. [1] Each SLEP must either be explicitly labeled as placed in the public + domain (see this SLEP as an example) or licensed under the `Open + Publication License`_. + +.. _Open Publication License: https://www.opencontent.org/openpub/ + + +Copyright +--------- + +This document has been placed in the public domain. [1]_ diff --git a/under_review.rst b/under_review.rst index 47ee5f8..6c83081 100644 --- a/under_review.rst +++ b/under_review.rst @@ -4,4 +4,4 @@ SLEPs under review .. toctree:: :maxdepth: 1 - slep010/proposal + slep011/proposal From 926ad397e08e4eaeee94f263b21af68bb0635c69 Mon Sep 17 00:00:00 2001 From: Nicolas Hug Date: Tue, 26 Nov 2019 18:54:19 -0500 Subject: [PATCH 2/9] still WIP --- slep011/proposal.rst | 72 ++++++++++++++++++++++++-------------------- 1 file changed, 39 insertions(+), 33 deletions(-) diff --git a/slep011/proposal.rst b/slep011/proposal.rst index 16dc10e..d91e09e 100644 --- a/slep011/proposal.rst +++ b/slep011/proposal.rst @@ -1,4 +1,4 @@ -.. _slep_010: +.. _slep_011: ===================== SLEP011: random state @@ -45,7 +45,7 @@ estimator stateful accross calls to `fit`. Almost all the other issues are consequences of this fact. Statefulness and violation of fit idempotence ---------------------------------------------- +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Estimators should be stateless. That is, any call to `fit` should forget whatever happened to previous calls to `fit`, provided that no warm-start is @@ -79,7 +79,7 @@ The estimator isn't "self contained" anymore and its behaviour is now dependent on some stuff that happen outside. Countless bugs --------------- +~~~~~~~~~~~~~~ Quoting @amueller from `14042 `_, many bugs @@ -102,7 +102,7 @@ This is a typical example of many other similar bugs that we need to monkey-patch with potentially overly complex logic. Cloning may return a different estimator ----------------------------------------- +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ `my_clone = clone(est)` returns an *unfitted* estimator whose parameters are (supposed to be) the same as those that were passed when `est` was @@ -129,7 +129,7 @@ In addition, `est` and `my_clone` are now interdependent because they share the same `rng` instance. Incompatibility with third-party estimators --------------------------------------------- +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ From the snippet above in the introduction, most meta-estimators will directly pass down `self.random_state` to their sub-estimators. Some @@ -138,7 +138,7 @@ or None. See e.g. `15611 `_ CV-Splitters statefulness -------------------------- +~~~~~~~~~~~~~~~~~~~~~~~~~ CV-splitters are stateful:: @@ -166,25 +166,23 @@ parameter, this is potentially confusing. The second inconsistency is that splitters are stateful by design, while we want our estimators to be stateless. Granted, splitters aren't estimators. -But, quoting `@GaelVaroquaux, +But, quoting `@GaelVaroquaux `_, consistency is one thing that we are really good at. - So it is important to have the splitters consistent with the estimators, w.r.t. the statelessness property. The current behaviour is not necessarily -clear for users. Fixing how random_state is handled in the splitters is one -of the entries in the `Roadmap -`_. +clear for users. -Some users may rely on the current behaviour, `to implement e.g. -bootstrapping, +Note Fixing how random_state is handled in the splitters is one of the +entries in the `Roadmap `_. Some +users may rely on the current behaviour, `to implement e.g. bootstrapping, `_. If we make the splitters stateless, the "old" behaviour can be easily reproduced by simply creating new CV instances, instead of calling `split` -on the same instance. Instances are dead cheap to create. +on the same instance. Instances are cheap to create. Potential bugs in custom parameter searches -------------------------------------------- +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ (This issue is a direct consequence of the splitters being stateful.) @@ -213,6 +211,9 @@ Proposed Solution We need a solution that fixes the statefulness of the estimators and the splitters. +A toy example of the proposed solution is implemented in this `notebook +`_. + The proposed solution is to store the *state* of a RandomState instance in `__init__`:: @@ -282,20 +283,22 @@ Advantages: Drawbacks: -- we break our convention that `__init__` should only ever store attributes, as - they are passed in. Note however that this convention is enforced so that - the semantic of `__init__` and `set_params` are the same, and to enable - people to change public attributes without having surprising behaviour. - **This is still respected here.** So this isn't really an issue. +- We break our convention that `__init__` should only ever store attributes, as + they are passed in. Note however that the reason we have this convention + is that we want the semantic of `__init__` and `set_params` are the same, + and we want to enable people to change public attributes without having + surprising behaviour. **This is still respected here.** So this isn't + really an issue. -- there is a subtelty that occurs when passing `None`. `check_random_state` +- There is a subtelty that occurs when passing `None`. `check_random_state` will return the singleton `np.random.mtrand._rand`, and we will call `get_state()` on the singleton. The thing is, its state won't change accross multiple instances unless the singleton is consumed. So if we do `a = Est(random_state=None); b = Est(random_state=None)`, a and b actually have exactly the same `random_state` attribute, since the state of the - singleton wasn't changed. One solution would be to consume the singleton's - RNG in `__init__`, with a private helper. + singleton wasn't changed. To circumvent this, the logic in `__init__` and + `set_params` involves a private helper that makes sure the singleton's RNG is + consumed. Please refer to the notebook. - The `__repr__()` will need to special-case the `random_state` attribute to avoid printing a long tuple. @@ -310,13 +313,13 @@ Alternative solutions ===================== Store a seed instead of a state --------------------------------- +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Instead of storing a state from `get_state()`, we could store a randomly generated seed:: def __init__(self, ..., random_state=None): - self.random_state = check_random_state(random_state).randint(0. BIG_INT) + self.random_state = check_random_state(random_state).randint(0, BIG_INT) Then instead of using `set_state` we could just use `rng = RandomState(seed=self.random_state)` in `fit` or `split`. @@ -326,9 +329,8 @@ Advantages: - It also fixes the third-party estimators issue, since we would be passing self.random_state which is an int - It's cheaper than storing 620 ints -- we only need to subcase `set_params()`. `get_params()` can be left as-is, - same for `clone`. - +- We don't need to artificially consume the singleton's RNG since it is + de-facto consumed anyway. Drawbacks: @@ -340,16 +342,15 @@ Drawbacks: - It is not backward compatible between versions. For example if you passed an int in version A (say 0.22), then in version B (with the new behaviour), your estimator will not start with the same RNG when `fit` is - called the first time. Same for splitters. For this reason, storing a - state may be preferable. + called the first time. Same for splitters. -Store the state in fit instead of in init ------------------------------------------ +Store the state in fit/split instead of in init +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Instead of storing the output of `get_state()` in `__init__`, we could store it the first time `fit()` is called. For example:: - def fit(self): + def fit(self): # or split() self._random_state = getattr(self, '_random_state', check_random_state(self.random_state).get_state()) rng = np.random.RandomState() rng.set_state(self._random_state) @@ -363,6 +364,11 @@ attribute, so we would need more intrusive changes to `set_params`, `get_params`, and `clone`. +Execution +========= + +That change might be a 1.0 thing. + References and Footnotes ------------------------ From e5ba859a22e496ef6829554512f770828631727d Mon Sep 17 00:00:00 2001 From: Nicolas Hug Date: Wed, 27 Nov 2019 08:28:28 -0500 Subject: [PATCH 3/9] more --- slep011/proposal.rst | 129 +++++++++++++++++++++++-------------------- 1 file changed, 70 insertions(+), 59 deletions(-) diff --git a/slep011/proposal.rst b/slep011/proposal.rst index d91e09e..6a311ec 100644 --- a/slep011/proposal.rst +++ b/slep011/proposal.rst @@ -1,23 +1,23 @@ .. _slep_011: -===================== -SLEP011: random state -===================== +==================================== +SLEP011: Fixing randomness behaviour +==================================== :Author: Nicolas Hug :Status: Under review :Type: Standards Track -:Created: 2019-11-23 +:Created: 2019-11-27 -Current status -============== +How we currently handle randomness +================================== `random_state` parameters are used commonly in estimators, and in CV splitters. They can be either int, RandomState instances, or None. The parameter is stored as an attribute in init and never changes, as per our convention. -`fit` and `split` usually look like this:: +`fit` and `split` methods typically look like this:: def fit(self, X, y): # or split(self, X, y) rng = check_random_state(self.random_state) @@ -33,16 +33,17 @@ convention. if x is a RandomState instance, return x if x is None, return numpy's RandomState singleton instance -Accepting instances or None comes with different complications, listed below. -`None` is the current default for all estimators / splitters. - +Accepting RandomState instances or None comes with different complications, +listed below. `None` is the current default for all estimators / splitters. Problems with passing RandomState instances or None =================================================== The main issue with RandomState instances and None is that it makes the -estimator stateful accross calls to `fit`. Almost all the other issues are -consequences of this fact. +estimators and the splitters stateful accross calls to `fit` or `split`. As +a result, different calls to `fit` or `split` on the same instance yield +different results. Almost all the other issues are consequences of this +fact. Statefulness and violation of fit idempotence ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -72,7 +73,8 @@ the rng has been consumed during the first called. The statefulness property isn't respected either since the first call to `fit` has an impact on the second. -The same goes with passing `None`. +The same goes with passing `None`: you get different models every time you call +`fit`. A related issue is that the `rng` may be consumed outside of the estimator. The estimator isn't "self contained" anymore and its behaviour is now @@ -93,10 +95,10 @@ bugs have been around forever and we just haven't discovered them yet. The bug in `14034 `_ is that the validation subset for early-stopping was computed using `self.random_state`: -in a warm-starting context, that subset may be different accross multiple -calls to `fit`, since the RNG is getting consumed. We instead want the -subset to be the same for all calls to `fit`. The fix was to store a private -seed that was generated the first time `fit` is called. +that subset is different accross multiple calls to `fit`, since the RNG is +getting consumed. This is a problem when doing warm-start, because we want +the subset to be the same for all calls to `fit` in this case. The fix was +to store a private seed that was generated the first time `fit` is called. This is a typical example of many other similar bugs that we need to monkey-patch with potentially overly complex logic. @@ -153,7 +155,7 @@ introduction above). This behaviour is inconsistent for two reasons. The first one is that if `rng` were an int, then `a` and `b` would have been -equal. Indeed, the behaviour of the CV splitter depends on the +equal. As a result, the behaviour of the CV splitter depends on the **type** of the `random_state` parameter:: - int -> stateless, get the same splits each time you call split() @@ -173,18 +175,14 @@ So it is important to have the splitters consistent with the estimators, w.r.t. the statelessness property. The current behaviour is not necessarily clear for users. -Note Fixing how random_state is handled in the splitters is one of the -entries in the `Roadmap `_. Some -users may rely on the current behaviour, `to implement e.g. bootstrapping, -`_. -If we make the splitters stateless, the "old" behaviour can be easily -reproduced by simply creating new CV instances, instead of calling `split` -on the same instance. Instances are cheap to create. +Note that fixing how random_state is handled in the splitters is one of the +entries in the `Roadmap `_. Potential bugs in custom parameter searches ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -(This issue is a direct consequence of the splitters being stateful.) +This issue is a direct consequence of the splitters being stateful. It's also +more advanced than the rest, you may want to skip it. We have a private API for subclassing BaseSearchCV and implementing custom parameter search strategies. The contract is that the custom class should @@ -193,8 +191,8 @@ call the `evaluate_candidates()` closure, were `cv.split()` will be called. Third-party developers may only *call* `evaluate_candidates()`, not change its content. Now, since `cv.split()` is called in `evaluate_candadates()`, -that means that `evalute_candidates()` will potentially evaluate its -candidates parameters **on different splits** each time it is called. +that means that `evalute_candidates()` will evaluate the candidate +parameters **on different splits** each time it is called. This is a quite subtle issue that third-party developers might easily overlook. @@ -209,21 +207,21 @@ Proposed Solution ================= We need a solution that fixes the statefulness of the estimators and the -splitters. +splitters. Most of the remaining issues would be fixed as a consequence. A toy example of the proposed solution is implemented in this `notebook -`_. +`_. +The bulk of the solution is to manipulate actual random *states*, as +returned by `get_state() +`_. -The proposed solution is to store the *state* of a RandomState instance in +Specifically, we would store the *state* of a RandomState instance in `__init__`:: def __init__(self, ..., random_state=None): self.random_state = check_random_state(random_state).get_state() -That `random_state` attribute is a tuple with about 620 integers, as returned -by `get_state() -`_. - +That `random_state` attribute is a tuple with about 620 integers. That state is then used in `fit` or in `split` as follows:: def fit(self, X, y): # or split() @@ -231,8 +229,8 @@ That state is then used in `fit` or in `split` as follows:: rng.set_state(self.random_state) # ... use rng as before -Since `self.random_state` is a tuple that never changes, calling `fit` or -`split` on the same instance always gives the same results. +Since `self.random_state` is an immutable tuple that never changes, calling +`fit` or `split` on the same instance always gives the same results. We want `__init__` and `set_params/get_params` to be consistent. To that end, we will need to special-case these methods:: @@ -293,7 +291,7 @@ Drawbacks: - There is a subtelty that occurs when passing `None`. `check_random_state` will return the singleton `np.random.mtrand._rand`, and we will call `get_state()` on the singleton. The thing is, its state won't change - accross multiple instances unless the singleton is consumed. So if we do + unless the singleton is consumed. So if we do `a = Est(random_state=None); b = Est(random_state=None)`, a and b actually have exactly the same `random_state` attribute, since the state of the singleton wasn't changed. To circumvent this, the logic in `__init__` and @@ -306,8 +304,8 @@ Drawbacks: - We need to store about 620 integers. This is however negligible w.r.t. e.g. the size of a typical dataset -- It does not fix the issue about third-party packages only accepting integers. - This can however be fixed in meta-estimator, independently. +- It does not fix the issue about third-party estimators only accepting + integers. This can however be fixed in each meta-estimator, independently. Alternative solutions ===================== @@ -334,10 +332,12 @@ Advantages: Drawbacks: -- We want that - `est.set_params(random_state=est.get_params()['random_state'])` does not - change `est.random_state`. With this logic, this is not possible to enforce. - Also, clone will not work was expected. +- Since we draw a seed in init (and in `set_params()`), `clone` will not + work as expected. In particular with `my_clone = clone(est)`, my_clone and + est cannot have the same `random_state` attribute. This is the same for + `my_clone.set_params(random_state=est.get_params()['random_state'])`. The + seed will have to be drawn in `set_params`, thus leading to a different + `random_state` attribute. - It is not backward compatible between versions. For example if you passed an int in version A (say 0.22), then in version B (with the new @@ -359,27 +359,38 @@ the first time `fit()` is called. For example:: The advantage is that we respect our convention with `__init__`. However, `fit` idempotency isn't respected anymore: the first call to `fit` -clearly influences all the other ones. This also introduces a private -attribute, so we would need more intrusive changes to `set_params`, -`get_params`, and `clone`. +clearly influences all the other ones. + +This also introduces a private attribute, so we would need more intrusive +changes to `set_params`, `get_params`, and `clone`. + +Execution and considerations +============================ +Making the estimator stateless can be considered a bug fix. However, we are +clearly changing the behaviour of the splitters, and some users may rely on +the current behaviour, `to implement e.g. bootstrapping +`_. +If we make the splitters stateless, the "old" behaviour can be easily +reproduced by simply creating new CV instances, instead of calling `split` +on the same instance. Instances are cheap to create. -Execution -========= +We would need a lot of outreach before introducing that change to let users +know about it. And depending on how comfortable we are with it, this might +be a 1.0 thing. -That change might be a 1.0 thing. -References and Footnotes ------------------------- +.. References and Footnotes +.. ------------------------ -.. [1] Each SLEP must either be explicitly labeled as placed in the public - domain (see this SLEP as an example) or licensed under the `Open - Publication License`_. +.. .. [1] Each SLEP must either be explicitly labeled as placed in the public +.. domain (see this SLEP as an example) or licensed under the `Open +.. Publication License`_. -.. _Open Publication License: https://www.opencontent.org/openpub/ +.. .. _Open Publication License: https://www.opencontent.org/openpub/ -Copyright ---------- +.. Copyright +.. --------- -This document has been placed in the public domain. [1]_ +.. This document has been placed in the public domain. [1]_ From 47011fa1795a164ba25c844b6637358258c72394 Mon Sep 17 00:00:00 2001 From: Nicolas Hug Date: Wed, 27 Nov 2019 08:29:49 -0500 Subject: [PATCH 4/9] title --- slep011/proposal.rst | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/slep011/proposal.rst b/slep011/proposal.rst index 6a311ec..af81a2c 100644 --- a/slep011/proposal.rst +++ b/slep011/proposal.rst @@ -1,8 +1,8 @@ .. _slep_011: -==================================== -SLEP011: Fixing randomness behaviour -==================================== +=================================== +SLEP011: Fixing randomness handling +=================================== :Author: Nicolas Hug :Status: Under review From e3f02ac408d7f577e7d8a828e35bf686cfbc9c90 Mon Sep 17 00:00:00 2001 From: Nicolas Hug Date: Wed, 27 Nov 2019 08:36:06 -0500 Subject: [PATCH 5/9] some more --- slep011/proposal.rst | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/slep011/proposal.rst b/slep011/proposal.rst index af81a2c..93c8547 100644 --- a/slep011/proposal.rst +++ b/slep011/proposal.rst @@ -264,6 +264,10 @@ Advantages: the same instance gives the same splits. In other words, it does what we want. +- The behaviour is clear and intuitive: the object is fully defined at init, + and only at init. Things that happen between init or fit *do not* influence + the state of the object. + - It is relatively simple to implement, and not too intrusive. - Backward compatibility is preserved between scikit-learn versions. Let A From 18a0f251bce62eb3d5dffddc3d810a67ee2e14d8 Mon Sep 17 00:00:00 2001 From: Nicolas Hug Date: Wed, 11 Dec 2019 08:33:31 -0500 Subject: [PATCH 6/9] Added abstract --- slep011/proposal.rst | 19 +++++++++++++++++-- 1 file changed, 17 insertions(+), 2 deletions(-) diff --git a/slep011/proposal.rst b/slep011/proposal.rst index 93c8547..9baa7cd 100644 --- a/slep011/proposal.rst +++ b/slep011/proposal.rst @@ -9,8 +9,23 @@ SLEP011: Fixing randomness handling :Type: Standards Track :Created: 2019-11-27 -How we currently handle randomness -================================== +Abstract +======== + +This SLEP aims at fixing the issues related to how scikit-learn handles +randomness of estimators and CV splitters. + +The proposed solution is to make estimators and splitter stateless, by +storing the state of the `random_state` parameter that is passed in +`__init__`. + +More than anything, this SLEP's goal is to *inform* the discussions related +to randomness handling: if we end up going with the status quo (i.e. keep +estimators and splitters stateful), then at least we are all aware of the +price we're paying. + +Background: How we currently handle randomness +============================================== `random_state` parameters are used commonly in estimators, and in CV splitters. They can be either int, RandomState instances, or None. The From 18ed43c438294eea4989bc9178d4d08f09798f44 Mon Sep 17 00:00:00 2001 From: Nicolas Hug Date: Fri, 4 Sep 2020 17:49:23 -0400 Subject: [PATCH 7/9] Added note about SH forbidding stateful splitters --- slep011/proposal.rst | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/slep011/proposal.rst b/slep011/proposal.rst index 9baa7cd..73210e3 100644 --- a/slep011/proposal.rst +++ b/slep011/proposal.rst @@ -216,7 +216,10 @@ Depending on the intended behaviour of the parameter search, this may or may not be a good thing. This is typically a bug if we implement successive halving + warm start (details ommitted here, you may refer to `this issue `_ for some more -details). +details). Currently, the `Successive Halving implementation +`_ **forbids users +from using stateful splitters**, e.g. `KFolds(5, shuffle=True, +random_state=None)` is forbidden. Proposed Solution ================= From c3f54959013ff083ccc233f6d05b5d38016c7b4c Mon Sep 17 00:00:00 2001 From: Nicolas Hug Date: Fri, 4 Sep 2020 18:15:53 -0400 Subject: [PATCH 8/9] Update slep011/proposal.rst Co-authored-by: Adrin Jalali --- slep011/proposal.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/slep011/proposal.rst b/slep011/proposal.rst index 73210e3..bacd7bf 100644 --- a/slep011/proposal.rst +++ b/slep011/proposal.rst @@ -116,7 +116,7 @@ the subset to be the same for all calls to `fit` in this case. The fix was to store a private seed that was generated the first time `fit` is called. This is a typical example of many other similar bugs that we need to -monkey-patch with potentially overly complex logic. +patch with potentially overly complex logic. Cloning may return a different estimator ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ From 907672c65df2e1db27fa72a62a6e3bab148c5ed1 Mon Sep 17 00:00:00 2001 From: Nicolas Hug Date: Mon, 7 Sep 2020 10:01:49 -0400 Subject: [PATCH 9/9] some important details after chat with Alex --- slep011/proposal.rst | 28 ++++++++++++++++++++++++++-- 1 file changed, 26 insertions(+), 2 deletions(-) diff --git a/slep011/proposal.rst b/slep011/proposal.rst index 73210e3..5bee03a 100644 --- a/slep011/proposal.rst +++ b/slep011/proposal.rst @@ -167,7 +167,19 @@ CV-splitters are stateful:: `a` and `b` are different splits, because of how `split` is implemented (see introduction above). -This behaviour is inconsistent for two reasons. +This means that the following code is very likely incorrect:: + + rng = np.random.RandomState(0) + cv = KFold(shuffle=True, random_state=rng) + estimators = [...] # the estimators you want to compare + for est in estimators: + cross_val_score(est, X, y, cv=cv) + +Users might not realize it, but **the estimators will be evaluated on +different splits**, even though they think they've set the random state by +passing a carefuly crafted instance. They should have passed an int. + +The behaviour of the splitters is inconsistent for two reasons. The first one is that if `rng` were an int, then `a` and `b` would have been equal. As a result, the behaviour of the CV splitter depends on the @@ -216,7 +228,8 @@ Depending on the intended behaviour of the parameter search, this may or may not be a good thing. This is typically a bug if we implement successive halving + warm start (details ommitted here, you may refer to `this issue `_ for some more -details). Currently, the `Successive Halving implementation +details). Currently, in order to prevent any potential bug and misconception +from users, the `Successive Halving implementation `_ **forbids users from using stateful splitters**, e.g. `KFolds(5, shuffle=True, random_state=None)` is forbidden. @@ -224,6 +237,10 @@ random_state=None)` is forbidden. Proposed Solution ================= +.. note:: + This proposed solution is a work in progress and there is room for + improvement. + We need a solution that fixes the statefulness of the estimators and the splitters. Most of the remaining issues would be fixed as a consequence. @@ -332,6 +349,13 @@ Drawbacks: Alternative solutions ===================== +Change the default of all random_state from `None` to a hardcoded int +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +This doesn't solve much: it might limit pitfalls in users code, but does not +address the core issue, which is the statefulness of splitters and +estimators. + Store a seed instead of a state ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~