Description
I need to update a model with new data without retraining the model from scratch. That is, incremental training for the cases when not all the data available right away.
This problem is similar to the "can't fit data in memory" problem, which was raised before in #56, #163, #244. But it's been 2-3 years ago and I see some changes in available parameters process_type
and updater
. The FAQ suggests using external memory via cacheprefix
. But this assumes I have all the data ready.
The solution in #1686 uses several iterations over the entire data.
Another related issue is #2970, in particular #2970 (comment). I tried 'process_type': 'update'
but it throws the initial error mentioned in that issue. Without it, the model gives inconsistent results.
I tried various combinations of parameters for train
in Python. And train
keeps making the model from scratch or something else. Here're the examples.
In a nutshell, this is what works (sometimes) and needs a feedback from more experienced members of the community:
print('Full')
bst_full = xgb.train(dtrain=train, params=params)
print(mean_squared_error(y_true=y_test, y_pred=bst_full.predict(test)))
print('Subset 1')
bst_1 = xgb.train(dtrain=train_1, params=params)
print(mean_squared_error(y_true=y_test, y_pred=bst_1.predict(test)))
print('Subset 2')
bst_2 = xgb.train(dtrain=train_2, params=params)
print(mean_squared_error(y_true=y_test, y_pred=bst_2.predict(test)))
print('Subset 1 updated with subset 2')
bst_1u2 = xgb.train(dtrain=train_1, params=params)
bst_1u2 = xgb.train(dtrain=train_2, params=params, xgb_model=bst_1u2)
print(mean_squared_error(y_true=y_test, y_pred=bst_1u2.predict(test)))
Here I'm looking to minimize the difference between the first and the fourth models. But it's keep jumping up and down. Even with equalling total boosting rounds in both methods.
Is there a canonical way to update models with newly arriving data alone?
Environment
- Python: 3.6
xgboost
: 0.7.post3
Similar issues
- Continue training in a dynamic training data environment Continue training in a dynamic training data environment #1225
- incremental learning, partial_fit like sklearn? incremental learning, partial_fit like sklearn? #1686
- [xgboost4j-spark] Incremental training [xgboost4j-spark] Incremental training #1859
- Retrain Xgboost Model Retrain Xgboost Model #2495
- "xgb_model" parameter in train() doesn't increment learning correctly "xgb_model" parameter in train() doesn't increment learning correctly #2707
Contributors saying new-data training was impossible at the time of writing:
Activity
hcho3 commentedon Jan 22, 2018
In #2495, I said incremental training was "impossible". A little clarification is in order.
xgb_model
) thus does not do what many would think it would do. One gets undefined behavior whenxgb.train
is asked to train further on a dataset different from one used to train the model given inxgb_model
. The behavior is "undefined" in the sense that the underlying algorithm makes no guarantee that the loss over (old data) + (new data) would be in any way reduced. Observe that the trees in the existing ensemble had no knowledge of the new incoming data. [EDIT: see @khotilov 's comment below to learn about situation where training continuation with different data would make sense.]'process_type': 'update'
. I think it is an experimental feature, so proceed with your own risk. To use the feature, make sure to install the latest XGBoost (0.7.post3
). The feature is currently quite limited, in that you are not allowed to modify the tree structure; only leaf values will be updated.Hope it helps!
hcho3 commentedon Jan 22, 2018
@antontarasenko Actually, I'm curious about the whole quest behind "incremental learning": what it means and why it is sought after. Can we schedule a brief Google hangout session to discuss?
hcho3 commentedon Jan 22, 2018
A vain guess: using an online algorithm for tree construction may do what you want. See this paper for instance.
Two limitations:
This paper is interesting too: it presents a way to find good splits without having all the data.
CodingCat commentedon Jan 22, 2018
@Yunni The first item in @hcho3 's reply reminds me something about the newly added feature of checkpoint in Spark
We should have something blocking the user to use different training dataset for this feature to guarantee correctness
Yunni commentedon Jan 23, 2018
Right. I think we should also check
boosterType
as well. We can put a metadata file which containsboosterType
and checksum of the dataset. Sounds good?CodingCat commentedon Jan 23, 2018
how you get checksum of dataset? content hash?
Yunni commentedon Jan 23, 2018
Yes. We can simply use LRC checksum
CodingCat commentedon Jan 23, 2018
Isn’t it time consuming to calculate hash? Maybe we can simply adding reminders in the comments
hcho3 commentedon Jan 23, 2018
@CodingCat Indeed, at minimum we need to warn the user not to change the dataset for training continuation.
That said, I just found a small warning in the CLI example, which says
Clearly we need to do a better job to make this warning more prominent.
khotilov commentedon Jan 25, 2018
@hcho3 while it's true that in some specific application contexts it makes sense to restrict training continuation to some data, I wouldn't make it a blank statement and wouldn't implement any hard restrictions on that. Yes, when you have a large dataset and your goal is to achieve optimal performance on the whole dataset, you wouldn't get it (for the reasons you have described) when incrementally learning with either separate parts of the datasets or with cumulatively increasing data.
However, there are applications when training continuation in new data makes good practical sense. E.g., in a situation when you get some new data that is related but has some sort of "concept drift", there are sometimes good chances that by taking an old model learned in old data as "prior knowledge", and adapting it to the new data by training continuation in that new data, you would get a better performing model for future use in data that would be like the new data than when training from scratch either with only this new data or with a combined sample of old + new data. Sometimes you don't even have access to old data anymore, or cannot combine it with your new data (e.g., for legal reasons).
hcho3 commentedon Jan 26, 2018
@khotilov I stand corrected. Calling training continuation "undefined behavior" was sweeping generalization, if what you have described is true.
I have a question for you: how does training continuation with boosting fare when it comes to handling concept drift? I read papers where the authors use random forests to handle concept drift, with a sliding window to deprecate old trees. (For an example, see this paper.)
JoshuaC3 commentedon Feb 5, 2018
Firstly, there is a paper about using a random forest to initialise you Gbm model to get better final results than just rf or Gbm and in few rounds. I cannot find it however :( This seems like a similar consept except you are using a different Gbm to initialise. I guess the other main difference is that it is on another set of data...
Secondly, sometimes it is more important to train a model quickly. I have been working on some time series problems where I have been doing transfer learning with LSTMs. I train the base model on generic historical data and then use transfer learning for the fine tuning on specific live data. It would take too long to train a full new model on live data when ideally I would. I think the same could be true of using xgboost. I. E. 95% of model optimal prediction is better than no prediction.
khotilov commentedon Feb 9, 2018
@hcho3 While my mention of the "concept drift" was in a broad sense, the boosting continuation would likely do better with "concept shifts" (when new data has some differences but is expected to remain stable). Picking slow continuous data "drifts" would be harder. But for strong trending drifts, even that random forest method might not work well, and some forecasting elements would have to be utilized. A lot would depend on a situation.
Also, a weak spot for boosted trees learners is that they are greedy feature-by-feature partitioners, so they might not pick well on such kinds of changes where the univariate effects were not so significant, e.g., when only interactions change. It might be rather useful if we could add some sort of limited look-ahead functionality to xgboost. E.g., in https://www.microsoft.com/en-us/research/wp-content/uploads/2017/06/kdd13.pdf they have a bivariate scan step that I think might work well as a spin of the histogram algorithm.
As for why the predictive performance in future data that is similar to "new data" is sometimes worse for a model trained over a combined "old data" + "new data" dataset, when comparing to a training continuation in "new data", this is because the former model would be optimized over the whole combined dataset, and that might happen at the expense of "new data" when that "new data" is somewhat different and relatively small.
liujxing commentedon Apr 12, 2018
I thought incremental training with minibatches of data (just like SGD) is kind of equivalent to subsampling the rows at each iteration. Is the subsampling in XGBoost only performed once for the whole training lifecycle or once every iteration?
benyaminelc90 commentedon Apr 21, 2018
I also need to use incremental learning. I've read all links that have been mentioned above. However, I'm confused.
Finally, is there any version of XGBoost to retrain a trained xgb model based on new received data point or batch of data?
I've found below links that addressed this issue before the date of this post. Don't they work? Can't we do incremental learning with them? What's the problem with them?
#1686
#484
https://stackoverflow.com/questions/38079853/how-can-i-implement-incremental-training-for-xgboost?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa
#2495
4 remaining items