Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in catboost.from_matrix(as.matrix(float_and_cat_features_data), : Unsupported label type, expecting double or integer, got character #1874

Open
rtedesco1197 opened this issue Oct 4, 2021 · 10 comments
Labels

Comments

@rtedesco1197
Copy link

Problem: Error in catboost.from_matrix(as.matrix(float_and_cat_features_data), : Unsupported label type, expecting double or integer, got character

catboost version: 1.0.0
Operating System: Windows 10
CPU: Ryzen 5?

GPU: Nvidia GTX 1660

Hello I am fitting a simple model with 2 numerical predictors and a binomial outcome.

When I try to tune_grid this model in R with tidymodels, I get this error even though there are no categorical predictors in my dataset:

Error in catboost.from_matrix(as.matrix(float_and_cat_features_data), : Unsupported label type, expecting double or integer, got character

@Glemhel
Copy link
Contributor

Glemhel commented Oct 4, 2021

Hello, @rtedesco1197!
As the error message states, there is a problem with label(target variable), not predictors.

Having this message, I assume that your label is either factor or character column. Unfortunately, CatBoost in R does not support neither of that yet. You can overcome this by manually converting your label to integer type: as.integer(label) if your label is factor, and as.integer(as.factor(label)) if your label is character (like c("1", "0", "1", "0")).

If your problem is more than this, could you please provide a reproducible example for further investigation?

@rtedesco1197
Copy link
Author

rtedesco1197 commented Oct 4, 2021

Thanks for the advice @Glemhel, however, when I try to classify an integer vector of (0,1,1,0,0) for classification I get:
Error: For a classification model, the outcome should be a factor.

When I take away the classification mode definition I get:
Error: For probability predictions, the object should be a classification model.

My data is:
train <- data.frame(x1=rnorm(21), x2=c(1:21), y=rbinom(21,1,.8))

@rtedesco1197
Copy link
Author

cb_spec <-
  boost_tree(
    trees = 1000,
    tree_depth = tune(),
    min_n = tune(),
    loss_reduction = tune(),
    sample_size = tune(),
    mtry = tune(),
    learn_rate = tune()
  ) %>%
  set_engine("catboost", loss_function = 'Logloss') %>%
  set_mode("classification")



cb_rec <- recipe(y ~ ., data = train)



cb_wf <- workflow() %>% 
  add_model(cb_spec) %>% 
  add_recipe(cb_rec)


cb_grid <- grid_latin_hypercube(
  tree_depth(),
  min_n(),
  loss_reduction(),
  sample_size = sample_prop(),
  mtry(range=c(1L, 2L)),
  learn_rate(),
  size = 10
)

cb_tune <- tune_grid(
  object = cb_wf,
  resamples = cb_folds,
  grid = cb_grid,
  metrics = metric_set(roc_auc),
  control = control_grid(save_pred = TRUE)
)

@Glemhel
Copy link
Contributor

Glemhel commented Oct 6, 2021

Hello again, @rtedesco1197!
I reproduced your issue, thank you for the code!

The problem is that some of the libraries you use(tune, workflows, I am not sure which one is doing the actual training) require target to be factor for classification mode. Without this mode, as you mentioned, using classification-specific functions (about class probabilities) is not available (For probability predictions, the object should be a classification model)

Unfortunately, CatBoost in R is not capable of handling factor target yet, and that is why you get initial error message about incorrect label type (factor is covered to character on the way inside catboost.R). That means there is no simple fix yet for your problem. But supporting factor target is a useful feature indeed, thank you for raising this problem!

However, I discovered that for this workflow to work, one has to download treesnip library, am I right you are using it?
It basically provides interface of catboost to parsnip, passing data and calling catboost functions.
In their code, there is a place where actual label is passed to catboost. Before calling catboost function, one can turn y from factor to integer as desired for catboost. But this could cause unexpected problems when trying to predict label, for example: one has to convert label back from integer to actual factor value, which may be tricky.

I experimented with this a bit: forked treesnip and added that condition. You can install my version via remotes::install_github("Glemhel/treesnip") , and it should probably work for your case at least for training. I also faced issue of 'Target contains only one unique value' while tuning - this was due to very small training set (21 samples), solve this by increasing number of samples (say 1000).

So, to sum up, thank you for reporting about the need of factor in R; and try installing my fork of treesnip as a workaround for your problem! Feel free to ask in case of any difficulties!

@rtedesco1197
Copy link
Author

rtedesco1197 commented Oct 6, 2021

That is working great, thank you so much for taking the time.

Another problem however, when I try to utilize my GPU with:
set_engine("catboost", loss_function = 'Logloss', task_type="GPU"

I get the error:
Error: rsm on GPU is supported for pairwise modes only

However, an rsm parameter is not present in the code I shared with you. Any idea what is going on there? Apologies in advance if this should be a different issue.

@Glemhel
Copy link
Contributor

Glemhel commented Oct 6, 2021

It looks like a bug in catboost interface of treesnip - rsm was incorrectly divided by the number of features in all cases.
I updated my fork, this should work for you now, try remotes::install_github("Glemhel/treesnip") again.
I will also create a pull request for treesnip to avoid this issue for others al well :)

@Glemhel
Copy link
Contributor

Glemhel commented Oct 6, 2021

A better solution is proposed here: curso-r/treesnip#20
However, there is no progress on it yet, as I see. You can use my fork as a workaround or experiment yourself :)

@rtedesco1197
Copy link
Author

Thanks for all your help :), but now I am getting:

Error: Engine 'catboost' is not supported for boost_tree(). See show_engines('boost_tree')

Sorry for dragging you down the rabbit hole.

@Glemhel
Copy link
Contributor

Glemhel commented Oct 7, 2021

I only seem to get this error if I do not do library(treesnip), new update in my fork does not lead to this error. Maybe there is mixed installation problem?
I suggest removing treesnip entirely with remove.packages(treesnip) and then doing remotes::install_github("Glemhel/treesnip") again.
Hope this helps!

@zhaoliang0302
Copy link

zhaoliang0302 commented Nov 6, 2022

I only seem to get this error if I do not do library(treesnip), new update in my fork does not lead to this error. Maybe there is mixed installation problem? I suggest removing treesnip entirely with remove.packages(treesnip) and then doing remotes::install_github("Glemhel/treesnip") again. Hope this helps!

Same error occurs although I installed your modified treesnip package:

remotes::install_github("Glemhel/treesnip")
library(treesnip)

catboost_model <-
  boost_tree( mode = "classification",
              mtry = tune(), # default [1, ?]
              trees = 1000, # default [1, 2000]
              min_n = 20, # default [2, 40]
              tree_depth = 6, # default [1, 15]
              learn_rate = 0.05, # default [-10, -1]
              engine = "catboost"
  )

catboost_wf <-
  workflow() %>%
  add_model(catboost_model) %>% 
  add_recipe(model_recipe)

catboost_results <-
  catboost_wf %>% 
  tune_grid(resamples = miR_cv,
            grid = 5,
            control = control_grid(save_pred = TRUE),
            metrics = metric_set(accuracy,roc_auc)
  )
#Warning message:
#All models failed. Run `show_notes(.Last.tune.result)` for more information.

show_notes(.Last.tune.result)

#unique notes:
#------------------------------------------------------------------------------------------
#Error in `check_spec_mode_engine_val()`:
#! Engine 'catboost' is not supported for `boost_tree()`. See `show_engines('boost_tree')`.

 show_engines('boost_tree')
# A tibble: 9 x 2
#   engine   mode          
#   <chr>    <chr>         
# 1 xgboost  classification
# 2 xgboost  regression    
# 3 C5.0     classification
# 4 spark    classification
# 5 spark    regression    
# 6 catboost regression    
# 7 catboost classification
# 8 lightgbm regression    
# 9 lightgbm classification

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants