Closed
Description
In this SO thread, http://stackoverflow.com/questions/38079853/how-can-i-implement-incremental-training-for-xgboost, I could find some info about incremental training using the python interface but I was not able to find any info about such feature in the Spark version. Is it available? Will it be in the future? Many thanks!
Activity
xydrolase commentedon Dec 9, 2016
Internally there is
update
function inml.dmlc.xgboost4j.scala.Booster
, which allows you to perform update iteration by iteration, but there's currently no wrapper for incremental updates in xgboost4j-spark.Are you envisioning a Spark-streaming application? If you're concerned about memory, you can try with
useExternalMemory = true
when you train with Spark's XGBoost.fc1plusx commentedon Dec 9, 2016
No, a standard Spark application. But it would be better for my experimentation to train the classifier incrementally with the new data that I get from time to time. Any plan for this feature in the future? Thanks!
xydrolase commentedon Dec 9, 2016
From a technical point of view, it is certainly doable with perhaps a similar interface to the Python one. But could you elaborate a bit more on the use case?
So you have, say, 50 million samples today, and you train a model with 50 trees.
A week later, you have another 50 million samples, and you want to update the model for another 50 iterations?
fc1plusx commentedon Dec 9, 2016
Exactly. To elaborate more: I would like to update the previous model with the new data that I get. I think it is interesting to see how the model performs when trained incrementally vs re-trained with the full updated dataset (i.e. train with data1 & update with data2 vs train with data1 & train again with data1+data2). Also, from a computational point of view, incremental training should be less expensive than re-training with the updated dataset.
I'm not aware of the XGBoost internal details, but If i understand correctly, the GBT algorithm is inherently incremental but there are quite a lot of optimisations in XGBoost that require all the data, making it somehow a hybrid incremental-batch classification/regression framework. Am I right? In such a case, is incremental training still possible?
xydrolase commentedon Dec 9, 2016
As I mentioned earlier, internally every iteration is trained by the
update()
function. So, training an entire model can be decomposed into multipleupdate()
stages. Thus, you can certainly break down the training into multiple sessions.If you train your model with
subsample
< 1.0, then, each tree is trained with slightly different data, so I don't see how incremental training would be that much different.Do keep an eye on the distribution of the features though. If the distribution changes, from time to time, between your incremental updates, then I think it may not end up working very well, even though GBT is not sensitive to scaling.
It seems something interesting, and worthy to explore, but I don't have the time to implement it in the short term.
fc1plusx commentedon Dec 9, 2016
Thanks a lot for your feedback!
I was aware of the feature distribution problem, and I don't think it is something that you can easily solve without allowing for complete model reorganisation. Nevertheless, this is shared by many incremental models and I believe is an issue somehow you need to learn to live with.
I would be really interested in seeing a (perhaps preliminary) incremental training method in XGBoost Spark (as you said, maybe similar to what's already available in the Python interface). In the meanwhile, I'll check what I can do on my side :-)
CodingCat commentedon Dec 10, 2016
instead of providing the interface shown in stackoverflow, #1670 shows a better solution?