Skip to content

Is it possible to update a model with new data without retraining the model from scratch? #3055

Closed
@antontarasenko

Description

@antontarasenko

I need to update a model with new data without retraining the model from scratch. That is, incremental training for the cases when not all the data available right away.

This problem is similar to the "can't fit data in memory" problem, which was raised before in #56, #163, #244. But it's been 2-3 years ago and I see some changes in available parameters process_type and updater. The FAQ suggests using external memory via cacheprefix. But this assumes I have all the data ready.

The solution in #1686 uses several iterations over the entire data.

Another related issue is #2970, in particular #2970 (comment). I tried 'process_type': 'update' but it throws the initial error mentioned in that issue. Without it, the model gives inconsistent results.

I tried various combinations of parameters for train in Python. And train keeps making the model from scratch or something else. Here're the examples.

In a nutshell, this is what works (sometimes) and needs a feedback from more experienced members of the community:

print('Full')
bst_full = xgb.train(dtrain=train, params=params)
print(mean_squared_error(y_true=y_test, y_pred=bst_full.predict(test)))

print('Subset 1')
bst_1 = xgb.train(dtrain=train_1, params=params)
print(mean_squared_error(y_true=y_test, y_pred=bst_1.predict(test)))

print('Subset 2')
bst_2 = xgb.train(dtrain=train_2, params=params)
print(mean_squared_error(y_true=y_test, y_pred=bst_2.predict(test)))

print('Subset 1 updated with subset 2')
bst_1u2 = xgb.train(dtrain=train_1, params=params)
bst_1u2 = xgb.train(dtrain=train_2, params=params, xgb_model=bst_1u2)
print(mean_squared_error(y_true=y_test, y_pred=bst_1u2.predict(test)))

Here I'm looking to minimize the difference between the first and the fourth models. But it's keep jumping up and down. Even with equalling total boosting rounds in both methods.

Is there a canonical way to update models with newly arriving data alone?

Environment

  • Python: 3.6
  • xgboost: 0.7.post3

Similar issues

Contributors saying new-data training was impossible at the time of writing:

Activity

hcho3

hcho3 commented on Jan 22, 2018

@hcho3
Collaborator

In #2495, I said incremental training was "impossible". A little clarification is in order.

  • As Tianqi pointed out in Incremental Loads #56, tree construction algorithms currently depend on the availability of the whole data to choose optimal splits.
  • In addition, the gradient boosting algorithm used in XGBoost was formulated with batch assumption, i.e. addition a new tree should each time reduce the training loss over whole training data.
  • "Training continuation" feature (with xgb_model) thus does not do what many would think it would do. One gets undefined behavior when xgb.train is asked to train further on a dataset different from one used to train the model given in xgb_model. The behavior is "undefined" in the sense that the underlying algorithm makes no guarantee that the loss over (old data) + (new data) would be in any way reduced. Observe that the trees in the existing ensemble had no knowledge of the new incoming data. [EDIT: see @khotilov 's comment below to learn about situation where training continuation with different data would make sense.]
  • One way out of this conundrum is to use the random forest approach: keep the old trees around, and fit new set of trees with new data only. Then combine the old and new trees in a random forest. This is rather unsatisfactory, since you're throwing away main benefits of boosted trees over random forests (e.g. more compact model, lower bias etc).
  • Another way is to allow the old trees to be modified. "Training continuation" feature does NOT do this. On the other hand, the incremental training example in incremental learning, partial_fit like sklearn? #1686 does modify the old trees in the light of new data. The example makes several passes over the data (old and new) to ensure that all trees receive updates that reflect all the data.
  • So for now, your hope appears to lie in the option 'process_type': 'update'. I think it is an experimental feature, so proceed with your own risk. To use the feature, make sure to install the latest XGBoost (0.7.post3). The feature is currently quite limited, in that you are not allowed to modify the tree structure; only leaf values will be updated.

Hope it helps!

hcho3

hcho3 commented on Jan 22, 2018

@hcho3
Collaborator

@antontarasenko Actually, I'm curious about the whole quest behind "incremental learning": what it means and why it is sought after. Can we schedule a brief Google hangout session to discuss?

hcho3

hcho3 commented on Jan 22, 2018

@hcho3
Collaborator

A vain guess: using an online algorithm for tree construction may do what you want. See this paper for instance.
Two limitations:

  • You'd need to assume that your data stream doesn't have any concept drift.
  • The gradient boosting algorithm needs to be reformulated using noisy samples rather than the whole training data. This paper by Friedman does this to some extent, although they do this simply to reduce overfitting.

This paper is interesting too: it presents a way to find good splits without having all the data.

CodingCat

CodingCat commented on Jan 22, 2018

@CodingCat
Member

@Yunni The first item in @hcho3 's reply reminds me something about the newly added feature of checkpoint in Spark

We should have something blocking the user to use different training dataset for this feature to guarantee correctness

Yunni

Yunni commented on Jan 23, 2018

@Yunni
Contributor

Right. I think we should also check boosterType as well. We can put a metadata file which contains boosterType and checksum of the dataset. Sounds good?

CodingCat

CodingCat commented on Jan 23, 2018

@CodingCat
Member

how you get checksum of dataset? content hash?

Yunni

Yunni commented on Jan 23, 2018

@Yunni
Contributor

Yes. We can simply use LRC checksum

CodingCat

CodingCat commented on Jan 23, 2018

@CodingCat
Member

Isn’t it time consuming to calculate hash? Maybe we can simply adding reminders in the comments

hcho3

hcho3 commented on Jan 23, 2018

@hcho3
Collaborator

@CodingCat Indeed, at minimum we need to warn the user not to change the dataset for training continuation.

That said, I just found a small warning in the CLI example, which says

Continue from Existing Model
If you want to continue boosting from existing model, say 0002.model, use

../../xgboost mushroom.conf model_in=0002.model num_round=2 model_out=continue.model

xgboost will load from 0002.model continue boosting for 2 rounds, and save output to continue.model. However, beware that the training and evaluation data specified in mushroom.conf should not change when you use this function. [Emphasis mine]

Clearly we need to do a better job to make this warning more prominent.

khotilov

khotilov commented on Jan 25, 2018

@khotilov
Member

@hcho3 while it's true that in some specific application contexts it makes sense to restrict training continuation to some data, I wouldn't make it a blank statement and wouldn't implement any hard restrictions on that. Yes, when you have a large dataset and your goal is to achieve optimal performance on the whole dataset, you wouldn't get it (for the reasons you have described) when incrementally learning with either separate parts of the datasets or with cumulatively increasing data.

However, there are applications when training continuation in new data makes good practical sense. E.g., in a situation when you get some new data that is related but has some sort of "concept drift", there are sometimes good chances that by taking an old model learned in old data as "prior knowledge", and adapting it to the new data by training continuation in that new data, you would get a better performing model for future use in data that would be like the new data than when training from scratch either with only this new data or with a combined sample of old + new data. Sometimes you don't even have access to old data anymore, or cannot combine it with your new data (e.g., for legal reasons).

hcho3

hcho3 commented on Jan 26, 2018

@hcho3
Collaborator

@khotilov I stand corrected. Calling training continuation "undefined behavior" was sweeping generalization, if what you have described is true.
I have a question for you: how does training continuation with boosting fare when it comes to handling concept drift? I read papers where the authors use random forests to handle concept drift, with a sliding window to deprecate old trees. (For an example, see this paper.)

JoshuaC3

JoshuaC3 commented on Feb 5, 2018

@JoshuaC3

Firstly, there is a paper about using a random forest to initialise you Gbm model to get better final results than just rf or Gbm and in few rounds. I cannot find it however :( This seems like a similar consept except you are using a different Gbm to initialise. I guess the other main difference is that it is on another set of data...

Secondly, sometimes it is more important to train a model quickly. I have been working on some time series problems where I have been doing transfer learning with LSTMs. I train the base model on generic historical data and then use transfer learning for the fine tuning on specific live data. It would take too long to train a full new model on live data when ideally I would. I think the same could be true of using xgboost. I. E. 95% of model optimal prediction is better than no prediction.

khotilov

khotilov commented on Feb 9, 2018

@khotilov
Member

@hcho3 While my mention of the "concept drift" was in a broad sense, the boosting continuation would likely do better with "concept shifts" (when new data has some differences but is expected to remain stable). Picking slow continuous data "drifts" would be harder. But for strong trending drifts, even that random forest method might not work well, and some forecasting elements would have to be utilized. A lot would depend on a situation.

Also, a weak spot for boosted trees learners is that they are greedy feature-by-feature partitioners, so they might not pick well on such kinds of changes where the univariate effects were not so significant, e.g., when only interactions change. It might be rather useful if we could add some sort of limited look-ahead functionality to xgboost. E.g., in https://www.microsoft.com/en-us/research/wp-content/uploads/2017/06/kdd13.pdf they have a bivariate scan step that I think might work well as a spin of the histogram algorithm.

As for why the predictive performance in future data that is similar to "new data" is sometimes worse for a model trained over a combined "old data" + "new data" dataset, when comparing to a training continuation in "new data", this is because the former model would be optimized over the whole combined dataset, and that might happen at the expense of "new data" when that "new data" is somewhat different and relatively small.

liujxing

liujxing commented on Apr 12, 2018

@liujxing

I thought incremental training with minibatches of data (just like SGD) is kind of equivalent to subsampling the rows at each iteration. Is the subsampling in XGBoost only performed once for the whole training lifecycle or once every iteration?

benyaminelc90

benyaminelc90 commented on Apr 21, 2018

@benyaminelc90

I also need to use incremental learning. I've read all links that have been mentioned above. However, I'm confused.
Finally, is there any version of XGBoost to retrain a trained xgb model based on new received data point or batch of data?
I've found below links that addressed this issue before the date of this post. Don't they work? Can't we do incremental learning with them? What's the problem with them?

#1686
#484
https://stackoverflow.com/questions/38079853/how-can-i-implement-incremental-training-for-xgboost?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa
#2495

4 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @CodingCat@khotilov@hcho3@tqchen@Yunni

        Issue actions

          Is it possible to update a model with new data without retraining the model from scratch? · Issue #3055 · dmlc/xgboost