[xgboost4j-spark] Incremental training #1859

Closed

Closed

[xgboost4j-spark] Incremental training#1859

In this SO thread, http://stackoverflow.com/questions/38079853/how-can-i-implement-incremental-training-for-xgboost, I could find some info about incremental training using the python interface but I was not able to find any info about such feature in the Spark version. Is it available? Will it be in the future? Many thanks!

Contributor

Internally there is update function in ml.dmlc.xgboost4j.scala.Booster, which allows you to perform update iteration by iteration, but there's currently no wrapper for incremental updates in xgboost4j-spark.

Are you envisioning a Spark-streaming application? If you're concerned about memory, you can try with useExternalMemory = true when you train with Spark's XGBoost.

Author

No, a standard Spark application. But it would be better for my experimentation to train the classifier incrementally with the new data that I get from time to time. Any plan for this feature in the future? Thanks!

Contributor

From a technical point of view, it is certainly doable with perhaps a similar interface to the Python one. But could you elaborate a bit more on the use case?

So you have, say, 50 million samples today, and you train a model with 50 trees.
A week later, you have another 50 million samples, and you want to update the model for another 50 iterations?

Author

Exactly. To elaborate more: I would like to update the previous model with the new data that I get. I think it is interesting to see how the model performs when trained incrementally vs re-trained with the full updated dataset (i.e. train with data1 & update with data2 vs train with data1 & train again with data1+data2). Also, from a computational point of view, incremental training should be less expensive than re-training with the updated dataset.

I'm not aware of the XGBoost internal details, but If i understand correctly, the GBT algorithm is inherently incremental but there are quite a lot of optimisations in XGBoost that require all the data, making it somehow a hybrid incremental-batch classification/regression framework. Am I right? In such a case, is incremental training still possible?

Contributor

As I mentioned earlier, internally every iteration is trained by the update() function. So, training an entire model can be decomposed into multiple update() stages. Thus, you can certainly break down the training into multiple sessions.

If you train your model with subsample < 1.0, then, each tree is trained with slightly different data, so I don't see how incremental training would be that much different.
Do keep an eye on the distribution of the features though. If the distribution changes, from time to time, between your incremental updates, then I think it may not end up working very well, even though GBT is not sensitive to scaling.

It seems something interesting, and worthy to explore, but I don't have the time to implement it in the short term.

Author

Thanks a lot for your feedback!

I was aware of the feature distribution problem, and I don't think it is something that you can easily solve without allowing for complete model reorganisation. Nevertheless, this is shared by many incremental models and I believe is an issue somehow you need to learn to live with.

I would be really interested in seeing a (perhaps preliminary) incremental training method in XGBoost Spark (as you said, maybe similar to what's already available in the Python interface). In the meanwhile, I'll check what I can do on my side :-)

Member

instead of providing the interface shown in stackoverflow, #1670 shows a better solution?

closed this as completed

on Dec 15, 2016

mentioned this

on Jan 22, 2018

Is it possible to update a model with new data without retraining the model from scratch? #3055

locked as resolved and limited conversation to collaborators

on Oct 26, 2018

to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

Labels

No labels

No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Participants