Is it possible to update a model with new data without retraining the model from scratch?

I need to update a model with new data without retraining the model from scratch. That is, incremental training for the cases when not all the data available right away. 

This problem is similar to the "can't fit data in memory" problem, which was raised before in #56, #163, #244. But it's been 2-3 years ago and I see some changes in available parameters `process_type` and `updater`. The [FAQ suggests](http://xgboost.readthedocs.io/en/latest/faq.html#i-have-a-big-dataset) using external memory via `cacheprefix`. But this assumes I have all the data ready.

The [solution](https://gist.github.com/ylogx/53fef94cc61d6a3e9b3eb900482f41e0) in #1686 uses several iterations over the entire data.

Another related issue is #2970, in particular https://github.com/dmlc/xgboost/issues/2970#issuecomment-354684604. I tried `'process_type': 'update'` but it throws the initial error mentioned in that issue. Without it, the model gives inconsistent results.

I tried various combinations of parameters for `train` in Python. And `train` keeps making the model from scratch or something else. Here're [the examples](https://stackoverflow.com/questions/48366379/reduce-error-rates-from-incremental-training-in-xgboost-in-python).

In a nutshell, this is what works (sometimes) and needs a feedback from more experienced members of the community:

```py
print('Full')
bst_full = xgb.train(dtrain=train, params=params)
print(mean_squared_error(y_true=y_test, y_pred=bst_full.predict(test)))

print('Subset 1')
bst_1 = xgb.train(dtrain=train_1, params=params)
print(mean_squared_error(y_true=y_test, y_pred=bst_1.predict(test)))

print('Subset 2')
bst_2 = xgb.train(dtrain=train_2, params=params)
print(mean_squared_error(y_true=y_test, y_pred=bst_2.predict(test)))

print('Subset 1 updated with subset 2')
bst_1u2 = xgb.train(dtrain=train_1, params=params)
bst_1u2 = xgb.train(dtrain=train_2, params=params, xgb_model=bst_1u2)
print(mean_squared_error(y_true=y_test, y_pred=bst_1u2.predict(test)))
```

Here I'm looking to minimize the difference between the first and the fourth models. But it's keep jumping up and down. Even with equalling total boosting rounds in both methods.

Is there a canonical way to update models with newly arriving data alone?

### Environment

* Python: 3.6
* `xgboost`: 0.7.post3

### Similar issues

* Continue training in a dynamic training data environment #1225
* incremental learning, partial_fit like sklearn? #1686
* [xgboost4j-spark] Incremental training #1859
* Retrain Xgboost Model #2495
* "xgb_model" parameter in train() doesn't increment learning correctly #2707

Contributors saying new-data training was impossible at the time of writing:

* https://github.com/dmlc/xgboost/issues/2495#issuecomment-317240566
* https://github.com/dmlc/xgboost/issues/56#issuecomment-53962722

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Is it possible to update a model with new data without retraining the model from scratch? #3055

Environment

Similar issues

4 remaining items

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Is it possible to update a model with new data without retraining the model from scratch? #3055

Description

Environment

Similar issues

Activity

hcho3 commented on Jan 22, 2018

hcho3 commented on Jan 22, 2018

hcho3 commented on Jan 22, 2018

CodingCat commented on Jan 22, 2018

Yunni commented on Jan 23, 2018

CodingCat commented on Jan 23, 2018

Yunni commented on Jan 23, 2018

CodingCat commented on Jan 23, 2018

hcho3 commented on Jan 23, 2018

khotilov commented on Jan 25, 2018

hcho3 commented on Jan 26, 2018

JoshuaC3 commented on Feb 5, 2018

khotilov commented on Feb 9, 2018

liujxing commented on Apr 12, 2018

benyaminelc90 commented on Apr 21, 2018

4 remaining items

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions