Skip to content

Multiclass LightGBM bug #3878

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
yaeldekel opened this issue Jun 18, 2019 · 3 comments · Fixed by #4608
Closed

Multiclass LightGBM bug #3878

yaeldekel opened this issue Jun 18, 2019 · 3 comments · Fixed by #4608
Assignees
Labels
P0 Priority of the issue for triage purpose: IMPORTANT, needs to be fixed right away.

Comments

@yaeldekel
Copy link

LightGBM trainer has two non-readonly fields called _numClass and _tlcNumClass. The second one is used to determine the number of predictors in the OVA predictor. However, the value of _tlcNumClass is only updated once, so if Fit is called again on the same estimator, it might give the wrong number of classes.

@yaeldekel yaeldekel added the P0 Priority of the issue for triage purpose: IMPORTANT, needs to be fixed right away. label Jun 18, 2019
@wschin wschin self-assigned this Jun 27, 2019
@antoniovs1029 antoniovs1029 self-assigned this Dec 11, 2019
@antoniovs1029
Copy link
Member

antoniovs1029 commented Dec 11, 2019

I've just taken a quick look to this. To do it, I modified the Multiclass LightGbm sample, and added code so to Fit again the pipeline with a dataview that has labels 1-4 (whereas the original dataview used to fit the pipeline the first time, had labels 1-3). As a result, when fitting again the pipeline _tlcNumClass was still set to be "3" (instead of 4), and when printing the metrics, only 3 labels were taken into account, ignoring the last one.

I believe this is the issue @yaeldekel is describing, right?

If this is the issue, wouldn't this be solved by simply moving the initialization of _tlcNumClass out of this if statement, so that it would always be initialized when fitting the pipeline?

@justinormont
Copy link
Contributor

What does calling Fit() twice do?

If it's not LightGBM's task=refit, can we expose it?

This may serve the needs of AutoML, which is looking for streamable trees. This would let us fit the tree structure on ~10 to 100GB of data, then stream the whole dataset (TBs) to refit the leaf node values. There's a similar option by using TreeFeat + linear model.
/cc @daholste

@yaeldekel
Copy link
Author

The estimators are intended to be stateless, so calling Fit() twice should produce exactly the same result as defining two estimators and calling Fit() once on each of them (except, perhaps, for any randomness used during training).
Regarding LightGBM refit, is it capable of doing something that cannot be done using TreeFeat + linear model? If the answer is yes, could you open a new issue for it?

antoniovs1029 added a commit that referenced this issue Jan 7, 2020
…trainer. (#4608)

* Reset _numberOfClassesIncludingNan everytime the trainer is fitted.
* Renamed some variables and added comments to make the code more legible
* Other minor changes in LightGBM classes
@ghost ghost locked as resolved and limited conversation to collaborators Mar 21, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
P0 Priority of the issue for triage purpose: IMPORTANT, needs to be fixed right away.
Projects
None yet
4 participants