-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Version 1.0 of scikit-learn #14386
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
There's a milestone: Personally, I think #7242 and #10603 need to be fixed. I know some other people, including @adrinjalali and @GaelVaroquaux feel strongly about #4497 and #4143. As you can see from the numbers, these issues are quite old. There is no consensus yet on how to address these. We have delayed 1.0 to allow a breaking change to fix these issues. Whether this is (still) a good strategy is debatable. We very recently introduced a governance document, roadmap and a enhancement proposal formalism. These have actually allowed us to discuss some of the longstanding issues in a more productive way. We could decide to postpone some of the issues, make a polished 1.0 and then address them in 2.0. There's actually two separate things we might desire from a 1.0: stable interfaces, and reliable implementation. So far most of our discussion has been around having the right interfaces, but there's also issues with our implementations. There's issues in I would at least like to resolve the issues in I'm not sure if writing a 1.0 paper is helpful, but it's something to consider. |
Cool, I missed the 1.0 milestone - let's see if I can contribute :-) It's awesome to see that this is already in progress. scikit-learn is a project that helped me a lot during my studies / career; I will try to find some time to give something back.
Personally, I would consider this as the "cherry on the top": Very nice to have, a very rewarding thing to do, probably less useful than many (all?) other things in the issue list. And also something that can be done at any point in time. I'm not sure if this "issue" should be closed then. Maybe it is a good way to channel comments / suggestions? |
One of the issue with adding additional papers is that it gets less clear for users what to cite and it splits our citation count. I think having an issue to discuss 1.0 is not a bad idea so I think it's fine to leave this open to have a central place for discussion. |
Since this came up again today: I'm a bit torn between wanting to have something I'm really happy with and getting a 1.0 out of the door. I don't think the wish-list items will be done for the next release (currently called 0.22), and there's maybe a slight chance they will be done for the one after that. If we want 1.0 do be stable in some sense, than we would really need to prioritize those issues, which we haven't done so far (from what I can tell). |
We've certainly got enough content and enough quality assurance tools to suggest that we can be 1.0. If we're aiming for 1.0 we should work out what we want to include, focusing, I think, more on consistency than features. 1.0 for instance might be a good opportunity to improve some parameter name/definition consistency, scale (and sample weight) invariance in parameter definitions, etc. FWIW, some of the changes around sample props may be best with backwards incompatibility. The change to NamedArray may also present backwards incompatibility that would deserve a major release. But, indeed, there would be no great harm if that major release was 1. to 2. rather than 0. to 1. |
+1 for moving to 1.0 soonish.
Should we do the NamedArray before, though?
|
Looking over the issues mentioned by @amueller in July, I wouldn't be concerned about 7242. Ensuring that the columns used for training / testing / inference are consistent is pretty basic. Regarding 10603, that is a valid point, and I think it should be true for a 1.0 relase. Issue 4497 seems more like something that should not hold up a 1.0 release, while I do think 4143 is important enough that I'd like to see it in 1.0. With the prevalence of pandas, I do have to say that named features is probably important enough to ensure that's in a 1.0 release. |
Another feature I'd personally like to see before 1.0 is native support for categorical data (in tree models, or at least some of them). Which is sort of a prerequisite for @amueller's #10603. And also make an informed decision on the randomness handling scikit-learn/enhancement_proposals#24 I agree with most of what has been said and I'm very happy to start considering 1.0 right now. Let's bring up the 1.0 topic during the next meeting so we can start figuring out what could / should be in there |
+1 to release 1.0 ASAP, two questions: |
(1) ideally these would be stable by then IMO |
#4143 (transforming y) is always *possible* already with an appropriate meta-estimator
designed for a specific use-case (and the resampling components mostly just
need decisions, although there are open questions about handling props
aside from X and y), while #4497 (sample props) is more or less impossible
for a user to achieve without rewriting our model selection tools.
#7242 should be doable by the next release.
#10603 has come a long way, but better handling of feature names would be
good either for v1 or v2.
(2) We mention things like "XXX is deprecated in 0.22 and will be removed in 0.24" so we promise that there will be 0.24?
I don't think that's a problem. There are lots of valid solutions, but
apart from anything else those messages are entirely about ensuring some
local backwards compatibility *within* a major version. Once we jump to 1.0
we can make whatever choices we like (within reasonable risk).
|
I would like to second the proposal for a version 1.0 paper, as publications are still an essential corner stone in the academic world. As a PhD student considering an academic career, and non-core developer of scikit-learn, my contributions currently work like this:
If there was a clear commitment to a publication, I would have leverage in discussions with my supervisor/faculty about allocating more time towards contributing to scikit-learn. I imagine other contributors are in similar situations.
I think these issues can be addressed. In my field (computational biology), papers about public resources are often updated every few years, i.e. there might be "The XY database in 2017", "in 2019", etc. One typically cites the latest iteration/highest version, which could be easily provided at https://scikit-learn.org/stable/about.html#citing-scikit-learn. |
I would propose to release 0.25 as 1.0 and be done with it. One year after this discussion was started we did move forward with some of those major points, but I can't say we are close to resolving all of them either. We can very well do a 2.0 for them if needed, otherwise the risk is that we will never release 1.0 (or at least not in the next few years) which is not great. The semantic versioning specifies that the version 1.0 should happen when software is used in production and there is a stable API. We had that for a very long time, and this wait for 1.0 until everything is is solved can slightly hurt scikit-learn image for users who implicitly expect it to follow semantic versioning. |
I think that we should release 0.25 to be consistent with the deprecations of 0.23 (e.g. https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_export.py#L138-L143). Nevertheless, I will be happy to see release 0.26 as 1.0. WDYT? |
@alfaro96 there are always rolling deprecations, and if we go with that, we can never release version 1.0 :D @rth I'd be happy to go with version 1.0, especially if that allows us to be backward incompatible w.r.t. sample props (cc @jnothman ). The solution we have right now is mostly backward compatible, except some edge cases, it'd be nice if we could just not be backward compatible. |
Maybe we should start to make a list of backward-incompatible stuff that should be reasonably easy to get in within 0.25 (maybe stretch to 0.26). On top of sample props, #15880 would be a good candidate IMHO. @rth let's discuss this during the next meeting? https://hackmd.io/AuqfmgwvTf-bFz60yjVG1g |
Another one is fixing the loss names inconsistencies #18248 |
Coming up with a plotting API for scoring and metrics interfaces: #15880 |
I'm fine1 with releasing one of the next 0.2X, X≥5, as 1.0. 1Emotionally, just "fine" is an understatement🎉 |
Aren't we already using SemVer? I think the next major release would be 2.0 (with breaking changes) and we would be releasing 1.1, 1.2 etc every 6 months |
I think that technically we are not using SemVer, since minor releases are
not backwards compatible with all minor releases from the same major
version series.
|
There is an exception for 0.X versions though where "Anything can change at any time". So technically we could release 1.0 and claim following SemVer. Though then we would have to increment the major version for any breaking change. On once side it's a shame that the scientific python ecosystem doesn't follow SemVer. Personal story from today: had package v3.X installed, a dependency requires version v2.0.0 (v2.0 to v2.8 exist); what's the latest version of package that would work with the dependency? Unknown. v2.8 doesn't work. Had to bisect all the way to v2.3. Without SemVer it's hard to know what works with what. Though, on the other side I have a hard time seeing how we don't end up at version e.g. 16.0.0 with semver after a few years, since each release has some breaking changes even if they are preceded by a deprecation window. Maybe it's more suitable to smaller libraries, not sure. To be clear, I'm not proposing to follow SemVer, just wondering about it. |
My preference would be to do a standard release 0.25 as 1.0 as we would have done (keeping standard deprecations), but reserve the right to do breaking changes for future major versions. So 1.0 wouldn't be very special, but we establish that 2.0 could have breaking changes. Breaking changes that might be interesting in the future
|
Is this a problem? I know a couple of projects with very high version numbers ... starting with those, that use calendar versioning (CalVer) 😄 |
Breaking changes that might be interesting in the future
• make pipeline clone
• allow .fit.transform != fit_transform (not technically an incompatible change)
I'm +1 with both of these :)
|
• allow .fit.transform != fit_transform (not technically an incompatible change)
I will not fight for this anymore. I am getting softer with age :)
|
I understand most of us seem to favor a backward-compatible 1.0, but these 2 points are still unclear to me:
Sorry, I guess some of you made these points during today's meeting, but I couldn't follow everybody's POV. |
IIUC, we are technically breaking strict backward compatibility in every minor release so far (after a deprecation cycle). With 1.0.0 we want to signal the maturity of scikit-learn. It will have some (deprecation cycled) breaking changes. |
The main point of version 1.0 for me is to show that scikit-learn is production-ready. I would like to get this signal soon and having not too many changes makes it more likely that it happens soon :-) |
@NicolasHug, all, I think it's time to bring this issue to a pragmatic discussion: there is a milestone for version 1.0, how many issues listed there (and those you might want to add to the list) will break backward compatibility? Are they supposed to be solved for 1.0 (~0.25 in an ideal world)? If the work is doable for 0.25 let's break things, if not, let's keep the 'mess' for 2.0... scikit-learn needs a 1.0.. :) |
I have created a new Breaking Change label to identify problematic issues that could not be easily be managed by the usual rolling deprecation cycle. Please use it liberally to tag issues that would mandate breaking backward compat harder than usual and would let us better decide if switching to a new versioning scheme (without the leading "0.") would also require us to evolve our current deprecation policy or not. |
@cmarmo I understand that the timing constraint would not allow us to cram everything from the milestone into 1.0. Indeed we need to be pragmatic and my initial #14386 (comment) was the following, but perhaps I did a poor job at explaining this:
Typically, the random state issue that was mentioned during the meeting does not qualify as reasonably easy to get in. But most of the other things mentioned here in this thread do (pipeline clones, sample props (trusting @adrinjalali on this), the loss name unification...) I'm still interested in @amueller @GaelVaroquaux @ogrisel answers to my questions above (#14386 (comment)). To clarify: I'm not trying to push for breaking things at all costs, I'm fine with not breaking backward compatibility. But I'd like the understand the reasoning behind this decision. I don't understand it so far and it makes me think that there's something obvious that I'm missing. |
I understand most of us seem to favor a backward-compatible 1.0, but these 2 points are still unclear to me:
I would rather avoid any breaking change if we can. Always following our deprecation cycle with warnings is nicer to our users.
I would rather not have any breaking change in 2.0 either. To me the point of dropping the leading 0 in our version number is just psychological/communication but it would not necessarily lead us to implement SemVer if the rolling deprecation policy can be preserved (just shifting what we consider major / bugfix releases by one digit to the left). |
Thanks @ogrisel, at least I understand your POV because it's consistent. But I still need to understand @amueller and @GaelVaroquaux 's reasoning then ;) |
I don't have a strong point of view here other than that we should limit breakage
Sent from my phone. Please forgive typos and briefness.
…On Sep 18, 2020, 20:01, at 20:01, Nicolas Hug ***@***.***> wrote:
Thanks @ogrisel, at least I understand your POV because it's
consistent. But I still need to understand @amueller and @GaelVaroquaux
's reasoning then ;)
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#14386 (comment)
|
@GaelVaroquaux so would you be OK with introducing a few minor breaking changes in 1.0, provided that the next breaking changes would happen many years from now (if ever)? I have the impression that during the last meeting, people understood my point of view as "let's make breaking changes often" but that's absolutely not what I'm advocating for. |
Not very happy. Actually, I'd rather have the breaking changes in a v .99.
Also, I'd like to be convinced that there is no way to avoid a smooth deprecation
Sent from my phone. Please forgive typos and briefness.
…On Sep 18, 2020, 20:23, at 20:23, Nicolas Hug ***@***.***> wrote:
@GaelVaroquaux so would you be OK with introducing a few minor breaking
changes in 1.0, provided that the next breaking changes would happen
many years from now (if ever)?
I have the impression that during the last meeting, people understood
my point of view as "let's make breaking changes often" but that's
absolutely not what I'm advocating for.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#14386 (comment)
|
Until no list of breaking features is available, no evaluation is possible: @ogrisel kindly provided the suitable label that hasn't been used for now. |
There are a bunch of comments with such issues in this thread already. #14386 (comment) #14386 (comment) #14386 (comment) #14386 (comment) |
BTW there is some relevant discussion about versioning models for numpy numpy/numpy#10156 and there is also NEP 23 which formalizes some of it.
Technically we have breaking changes in each release, at least for some infrequent use cases (e.g. removal of a parameter) after a deprecation cycle. I don't think that would change after 1.0 unless we want to significantly change our deprecation policy? Though I agree that we should try to minimize the amount of breaking changes as much as possible, including for v1.0. |
Technically we have breaking changes in each release, at least for some infrequent use cases (e.g. removal of a parameter) after a deprecation cycle.
Yes, I think that the level of breakage that we have is acceptable: users do not seem to be anxious to update.
|
Version 1.0 was released recently. On the way, it was good to have this issue for discussions and raising concerns and questions. A big thank you to everyone who participated! |
I just realized (by looking at 0ver.org ) that scikit-learn is also in Version 0.x. I could not find any discussion about version 1.0 in the issues.
I would like to understand the reasoning / see if there is any other channel where this topic is discussed.
Why it matters
Semantic Versioning is wide spread. People who are new to Python still know (parts of) semantic versioning. Having software in a 0.x version feels as if the software is brittle / prone to get breaking changes.
scikit-learn does not use any of the
Development Status ::
trove classifiers (setup.py, list of trove classifers). Although I guess anybody working with Python has heard from scikit-learn, it might be hard to reason about the maturity of the project quickly as a newcomer.An alternative is calendar based versioning.
Why scikit-learn should be 1.0
The Process to get to 1.0
scipy made this really nice. I guess some of the developers there also have a look at scikit-learn, so I hope to get more details.
From my perspective, it looked as if the scipy community made the following steps to get to 1.0:
The text was updated successfully, but these errors were encountered: