-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Update scikit learn 1.2 #1611
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Update scikit learn 1.2 #1611
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## development #1611 +/- ##
===============================================
- Coverage 83.42% 83.32% -0.11%
===============================================
Files 156 156
Lines 11927 12298 +371
Branches 1896 2033 +137
===============================================
+ Hits 9950 10247 +297
- Misses 1412 1453 +41
- Partials 565 598 +33 |
This was referenced Nov 15, 2022
I'm interested in this change. Is anything holding back this PR? |
Any updates on this ? |
@eddiebergman can we help with this? Are the below two steps all that remains?
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR attempts to cleanly just update scikit-learn to 1.2 which necessitates updating to Python 3.8. This means we are locked out of Colab from any newer versions as Google Colab only support python 3.7
Will be a live PR, going through the changelogs for 1.0.2 and changelogs for 1.1.3
Supposedly relevant Changelog entries
API Change The option for using the squared error via loss and criterion parameters was made more consistent. The preferred way is by setting the value to "squared_error". Old option names are still valid, produce the same models, but are deprecated and will be removed in version 1.2. #19310 by Christian Lorentzen.
For ensemble.ExtraTreesRegressor, criterion="mse" is deprecated, use "squared_error" instead which is now the default.
For ensemble.GradientBoostingRegressor, loss="ls" is deprecated, use "squared_error" instead which is now the default.
For ensemble.RandomForestRegressor, criterion="mse" is deprecated, use "squared_error" instead which is now the default.
For ensemble.HistGradientBoostingRegressor, loss="least_squares" is deprecated, use "squared_error" instead which is now the default.
For linear_model.RANSACRegressor, loss="squared_loss" is deprecated, use "squared_error" instead.
For linear_model.SGDRegressor, loss="squared_loss" is deprecated, use "squared_error" instead which is now the default.
For tree.DecisionTreeRegressor, criterion="mse" is deprecated, use "squared_error" instead which is now the default.
For tree.ExtraTreeRegressor, criterion="mse" is deprecated, use "squared_error" instead which is now the default.
API Change The option for using the absolute error via loss and criterion parameters was made more consistent. The preferred way is by setting the value to "absolute_error". Old option names are still valid, produce the same models, but are deprecated and will be removed in version 1.2. #19733 by Christian Lorentzen.
For ensemble.ExtraTreesRegressor, criterion="mae" is deprecated, use "absolute_error" instead.
For ensemble.GradientBoostingRegressor, loss="lad" is deprecated, use "absolute_error" instead.
For ensemble.RandomForestRegressor, criterion="mae" is deprecated, use "absolute_error" instead.
For ensemble.HistGradientBoostingRegressor, loss="least_absolute_deviation" is deprecated, use "absolute_error" instead.
For linear_model.RANSACRegressor, loss="absolute_loss" is deprecated, use "absolute_error" instead which is now the default.
For tree.DecisionTreeRegressor, criterion="mae" is deprecated, use "absolute_error" instead.
For tree.ExtraTreeRegressor, criterion="mae" is deprecated, use "absolute_error" instead.
API Change np.matrix usage is deprecated in 1.0 and will raise a TypeError in 1.2. #20165 by Thomas Fan.
API Change get_feature_names_out has been added to the transformer API to get the names of the output features. get_feature_names has in turn been deprecated. #18444 by Thomas Fan.
API Change All estimators store feature_names_in_ when fitted on pandas Dataframes. These feature names are compared to names seen in non-fit methods, e.g. transform and will raise a FutureWarning if they are not consistent. These FutureWarning s will become ValueError s in 1.2. #18010 by Thomas Fan.
API Change Deprecates the following keys in cv_results_: 'mean_score', 'std_score', and 'split(k)_score' in favor of 'mean_test_score' 'std_test_score', and 'split(k)_test_score'. #20583 by Thomas Fan.
API Change Deprecates datasets.load_boston in 1.0 and it will be removed in 1.2. Alternative code snippets to load similar datasets are provided. Please report to the docstring of the function for details. #20729 by Guillaume Lemaitre.
API Change Rename variable names in KernelPCA to improve readability. lambdas_ and alphas_ are renamed to eigenvalues_ and eigenvectors_, respectively. lambdas_ and alphas_ are deprecated and will be removed in 1.2. #19908 by Kei Ishikawa.
API Change Attribute n_features_in_ in dummy.DummyRegressor and dummy.DummyRegressor is deprecated and will be removed in 1.2. #20960 by Thomas Fan.
Fix Fixed the range of the argument max_samples to be (0.0, 1.0] in ensemble.RandomForestClassifier, ensemble.RandomForestRegressor, where max_samples=1.0 is interpreted as using all n_samples for bootstrapping. #20159 by @murata-yu.
API Change Removes tol=None option in ensemble.HistGradientBoostingClassifier and ensemble.HistGradientBoostingRegressor. Please use tol=0 for the same behavior. #19296 by Thomas Fan.
Fix Raise a warning in feature_extraction.text.CountVectorizer with lowercase=True when there are vocabulary entries with uppercase characters to avoid silent misses in the resulting feature vectors. #19401 by Zito Relova
API Change Raises an error in feature_selection.VarianceThreshold when the variance threshold is negative. #20207 by Tomohiro Endo
API Change Deprecates grid_scores_ in favor of split scores in cv_results_ in feature_selection.RFECV. grid_scores_ will be removed in version 1.2. #20161 by Shuhei Kayawari and @arka204.
Enhancement Add max_samples parameter in inspection.permutation_importance. It enables to draw a subset of the samples to compute the permutation importance. This is useful to keep the method tractable when evaluating feature importance on large datasets. #20431 by Oliver Pfaffel.
Feature Added sample_weight parameter to linear_model.LassoCV and linear_model.ElasticNetCV. #16449 by Christian Lorentzen.
Feature Added new solver lbfgs (available with solver="lbfgs") and positive argument to linear_model.Ridge. When positive is set to True, forces the coefficients to be positive (only supported by lbfgs). #20231 by Toshihiro Nakae.
Enhancement fit method preserves dtype for numpy.float32 in linear_model.Lars, linear_model.LassoLars, linear_model.LassoLars, linear_model.LarsCV and linear_model.LassoLarsCV. #20155 by Takeshi Oura.
API Change : The parameter normalize of linear_model.LinearRegression is deprecated and will be removed in 1.2. Motivation for this deprecation: normalize parameter did not take any effect if fit_intercept was set to False and therefore was deemed confusing. The behavior of the deprecated LinearModel(normalize=True) can be reproduced with a Pipeline with LinearModel (where LinearModel is LinearRegression, Ridge, RidgeClassifier, RidgeCV or RidgeClassifierCV) as follows: make_pipeline(StandardScaler(with_mean=False), LinearModel()). The normalize parameter in LinearRegression was deprecated in #17743 by Maria Telenczuk and Alexandre Gramfort. Same for Ridge, RidgeClassifier, RidgeCV, and RidgeClassifierCV, in: #17772 by Maria Telenczuk and Alexandre Gramfort. Same for BayesianRidge, ARDRegression in: #17746 by Maria Telenczuk. Same for Lasso, LassoCV, ElasticNet, ElasticNetCV, MultiTaskLasso, MultiTaskLassoCV, MultiTaskElasticNet, MultiTaskElasticNetCV, in: #17785 by Maria Telenczuk and Alexandre Gramfort.
API Change Keyword validation has moved from init and set_params to fit for the following estimators conforming to scikit-learn’s conventions: SGDClassifier, SGDRegressor, SGDOneClassSVM, PassiveAggressiveClassifier, and PassiveAggressiveRegressor. #20683 by Guillaume Lemaitre.
Enhancement The model_selection.BaseShuffleSplit base class is now public. #20056 by @pabloduque0.
API Change The attribute sigma_ is now deprecated in naive_bayes.GaussianNB and will be removed in 1.2. Use var_ instead. #18842 by Hong Shao Yang.
Fix The preprocessing.StandardScaler.inverse_transform method now raises error when the input data is 1D. #19752 by Zhehao Liu.
Fix The fit method of preprocessing.OrdinalEncoder will not raise error when handle_unknown='ignore' and unknown categories are given to fit. #19906 by Zhehao Liu.
API Change The n_input_features_ attribute of preprocessing.PolynomialFeatures is deprecated in favor of n_features_in_ and will be removed in 1.2. #20240 by Jérémie du Boisberranger.
API Change The n_features_ attribute of tree.DecisionTreeClassifier, tree.DecisionTreeRegressor, tree.ExtraTreeClassifier and tree.ExtraTreeRegressor is deprecated in favor of n_features_in_ and will be removed in 1.2. #20272 by Jérémie du Boisberranger.
Enhancement utils.validation.check_is_fitted now uses sklearn_is_fitted if available, instead of checking for attributes ending with an underscore. This also makes pipeline.Pipeline and preprocessing.FunctionTransformer pass check_is_fitted(estimator). #20657 by Adrin Jalali.
Fix Support for np.matrix is deprecated in check_array in 1.0 and will raise a TypeError in 1.2. #20165 by Thomas Fan.
Fix impute.SimpleImputer uses the dtype seen in fit for transform when the dtype is object. #22063 by Thomas Fan.
Enhancement Added an extension in doc/conf.py to automatically generate the list of estimators that handle NaN values. #23198 by Lise Kleiber, Zhehao Liu and Chiara Marmo.
Efficiency cluster.KMeans now defaults to algorithm="lloyd" instead of algorithm="auto", which was equivalent to algorithm="elkan". Lloyd’s algorithm and Elkan’s algorithm converge to the same solution, up to numerical rounding errors, but in general Lloyd’s algorithm uses much less memory, and it is often faster.
API Change The option for using the log loss, aka binomial or multinomial deviance, via the loss parameters was made more consistent. The preferred way is by setting the value to "log_loss". Old option names are still valid and produce the same models, but are deprecated and will be removed in version 1.3.
For ensemble.GradientBoostingClassifier, the loss parameter name “deviance” is deprecated in favor of the new name “log_loss”, which is now the default. #23036 by Christian Lorentzen.
For ensemble.HistGradientBoostingClassifier, the loss parameter names “auto”, “binary_crossentropy” and “categorical_crossentropy” are deprecated in favor of the new name “log_loss”, which is now the default. #23040 by Christian Lorentzen.
For linear_model.SGDClassifier, the loss parameter name “log” is deprecated in favor of the new name “log_loss”. #23046 by Christian Lorentzen.
Major Feature Added additional option loss="quantile" to ensemble.HistGradientBoostingRegressor for modelling quantiles. The quantile level can be specified with the new parameter quantile. #21800 and #20567 by Christian Lorentzen.
Enhancement ensemble.RandomForestClassifier and ensemble.ExtraTreesClassifier have the new criterion="log_loss", which is equivalent to criterion="entropy". #23047 by Christian Lorentzen.
Enhancement Adds get_feature_names_out to ensemble.VotingClassifier, ensemble.VotingRegressor, ensemble.StackingClassifier, and ensemble.StackingRegressor. #22695 and #22697 by Thomas Fan.
API Change Changed the default of max_features to 1.0 for ensemble.RandomForestRegressor and to "sqrt" for ensemble.RandomForestClassifier. Note that these give the same fit results as before, but are much easier to understand. The old default value "auto" has been deprecated and will be removed in version 1.3. The same changes are also applied for ensemble.ExtraTreesRegressor and ensemble.ExtraTreesClassifier. #20803 by Brian Sun.
Fix predict and sample_y methods of gaussian_process.GaussianProcessRegressor now return arrays of the correct shape in single-target and multi-target cases, and for both normalize_y=False and normalize_y=True. #22199 by Guillaume Lemaitre, Aidar Shakerimoff and Tenavi Nakamura-Zimmerer.
Enhancement impute.SimpleImputer now warns with feature names when features which are skipped due to the lack of any observed values in the training set. #21617 by Christian Ritter.
Enhancement Added support for pd.NA in impute.SimpleImputer. #21114 by Ying Xiong.
Enhancement Adds get_feature_names_out to impute.SimpleImputer, impute.KNNImputer, impute.IterativeImputer, and impute.MissingIndicator. #21078 by Thomas Fan.
API Change The verbose parameter was deprecated for impute.SimpleImputer. A warning will always be raised upon the removal of empty columns. #21448 by Oleh Kozynets and Christian Ritter.
Feature preprocessing.OneHotEncoder now supports grouping infrequent categories into a single feature. Grouping infrequent categories is enabled by specifying how to select infrequent categories with min_frequency or max_categories. #16018 by Thomas Fan.
Enhancement Adds encoded_missing_value to preprocessing.OrdinalEncoder to configure the encoded value for missing data. #21988 by Thomas Fan.
Enhancement svm.OneClassSVM, svm.NuSVC, svm.NuSVR, svm.SVC and svm.SVR now expose n_iter_, the number of iterations of the libsvm optimization routine. #21408 by Juan Martín Loyola.
Enhancement tree.DecisionTreeClassifier and tree.ExtraTreeClassifier have the new criterion="log_loss", which is equivalent to criterion="entropy". #23047 by Christian Lorentzen.
API Change Changed the default value of max_features to 1.0 for tree.ExtraTreeRegressor and to "sqrt" for tree.ExtraTreeClassifier, which will not change the fit result. The original default value "auto" has been deprecated and will be removed in version 1.3. Setting max_features to "auto" is also deprecated for tree.DecisionTreeClassifier and tree.DecisionTreeRegressor. #22476 by Zhehao Liu.
Write separate issue to update this. It will break a lot of our test fixtures.
Write separate issue to investigate this
Write seperate issue to investigate this
Seperate issue
n_iter_
to BaseLibSVM scikit-learn/scikit-learn#21408 by Juan Martín Loyola.Could be made to be iterative fit methods? Seperate issue
Next steps