Skip to content

Problem with using tidy() (S3 registration) on Windows in parallel #64

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
icejean opened this issue Dec 24, 2022 · 6 comments
Closed

Problem with using tidy() (S3 registration) on Windows in parallel #64

icejean opened this issue Dec 24, 2022 · 6 comments

Comments

@icejean
Copy link

icejean commented Dec 24, 2022

Hi,all,
I'm reading this book to get to know with tidymodels, and get an problem with running parallel resampling on Windows.
As to the examples in Chapter 10, the main parallel part of fit_resamples() is O.K. with library doParallel on Windows, but only one issue with extract_fit_parsnip(x):

library(tidymodels)
# All operating systems
library(doParallel)
library(kableExtra)
library(tidyr)
tidymodels_prefer()

data(ames)
ames <- mutate(ames, Sale_Price = log10(Sale_Price))

set.seed(502)
ames_split <- initial_split(ames, prop = 0.80, strata = Sale_Price)
ames_train <- training(ames_split)
ames_test  <-  testing(ames_split)

ames_rec <- 
  recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type + 
           Latitude + Longitude, data = ames_train) %>%
  step_log(Gr_Liv_Area, base = 10) %>% 
  step_other(Neighborhood, threshold = 0.01) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  step_interact( ~ Gr_Liv_Area:starts_with("Bldg_Type_") ) %>% 
  step_ns(Latitude, Longitude, deg_free = 20)

lm_model <- linear_reg() %>% set_engine("lm")

lm_wflow <- 
  workflow() %>% 
  add_model(lm_model) %>% 
  add_recipe(ames_rec)

lm_fit <- fit(lm_wflow, ames_train)

# This line is O.K.
extract_fit_parsnip(lm_fit) %>% tidy()

# ------------------------------------------------------------------------------------
set.seed(1001)
ames_folds <- vfold_cv(ames_train, v = 10)

# Create a cluster object and then register: 
# cl <- makePSOCKcluster(parallel::detectCores())
cl <- makePSOCKcluster(10)
registerDoParallel(cl)

get_model <- function(x) {
  # not O.K. on Windows & Linux.
  extract_fit_parsnip(x) %>% tidy()
  # This line is O.K.
  # extract_recipe(x, estimated = TRUE)
}

ctrl <- control_resamples(save_pred=TRUE,verbose=TRUE, extract = get_model)
set.seed(1003)
lm_res <- lm_wflow %>% fit_resamples(resamples = ames_folds, control = ctrl)

# Stop parallel
stopCluster(cl)

#  These lines are O.K.
lm_res
lm_res$.metrics[[1]]
lm_res$.notes[[1]]
lm_res$.predictions[[1]]
> lm_res$.extracts[[1]]
# A tibble: 1 x 2
  .extracts      .config             
  <list>         <chr>               
1 <try-errr [1]> Preprocessor1_Model1

> # To get the results
> lm_res$.extracts[[1]][[1]]
[[1]]
[1] "Error in UseMethod(\"tidy\") : \n  no applicable method for 'tidy' applied to an object of class \"lm\"\n"
attr(,"class")
[1] "try-error"
attr(,"condition")
<simpleError in UseMethod("tidy"): no applicable method for 'tidy' applied to an object of class "lm">

Any idea?

Best regards.

@icejean
Copy link
Author

icejean commented Dec 24, 2022

Same results on both R-4.2.2 and R-4.1.2.

@icejean
Copy link
Author

icejean commented Dec 25, 2022

Here's a workaround, the problem is function extract_fit_parsnip(x) returns a list but not a parsnip model when using library(doParallel) instead of library(doMC):

library(tidymodels)
# All operating systems
library(doParallel)
library(kableExtra)
library(tidyr)
tidymodels_prefer()

data(ames)
ames <- mutate(ames, Sale_Price = log10(Sale_Price))

set.seed(502)
ames_split <- initial_split(ames, prop = 0.80, strata = Sale_Price)
ames_train <- training(ames_split)
ames_test  <-  testing(ames_split)

ames_rec <- 
  recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type + 
           Latitude + Longitude, data = ames_train) %>%
  step_log(Gr_Liv_Area, base = 10) %>% 
  step_other(Neighborhood, threshold = 0.01) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  step_interact( ~ Gr_Liv_Area:starts_with("Bldg_Type_") ) %>% 
  step_ns(Latitude, Longitude, deg_free = 20)

lm_model <- linear_reg() %>% set_engine("lm")

lm_wflow <- 
  workflow() %>% 
  add_model(lm_model) %>% 
  add_recipe(ames_rec)

lm_fit <- fit(lm_wflow, ames_train)

# This line is O.K.
lm_fited <- extract_fit_parsnip(lm_fit)
tidy(lm_fited)

# ------------------------------------------------------------------------------------
set.seed(1001)
ames_folds <- vfold_cv(ames_train, v = 10)

# Create a cluster object and then register: 
# cl <- makePSOCKcluster(parallel::detectCores())
cl <- makePSOCKcluster(10)
registerDoParallel(cl)

get_model <- function(x) {
  # extract_fit_parsnip(x) %>% tidy()
  extract_fit_parsnip(x)
}

ctrl <- control_resamples(save_pred=TRUE,verbose=TRUE, extract = get_model)
set.seed(1003)
lm_res <- lm_wflow %>% fit_resamples(resamples = ames_folds, control = ctrl)

# Stop parallel
stopCluster(cl)

lm_res$.extracts[[1]]
# To get the results
lm_res$.extracts[[1]][[1]]

# -----------------------------------------------------------------------------------
lm_fited
lm_fited %>% tidy()

# To get the results
test<-lm_res$.extracts[[1]][[1]]
test
tidy(test[[1]])

> tidy(test[[1]])
# A tibble: 73 × 5
   term                            estimate std.error statistic   p.value
   <chr>                              <dbl>     <dbl>     <dbl>     <dbl>
 1 (Intercept)                     -0.140    0.308       -0.455 6.49e-  1
 2 Gr_Liv_Area                      0.620    0.0167      37.1   8.50e-231
 3 Year_Built                       0.00177  0.000143    12.3   9.98e- 34
 4 Neighborhood_College_Creek      -0.0125   0.0358      -0.348 7.28e-  1
 5 Neighborhood_Old_Town           -0.0643   0.0133      -4.84  1.38e-  6
 6 Neighborhood_Edwards            -0.102    0.0297      -3.43  6.20e-  4
 7 Neighborhood_Somerset            0.0709   0.0209       3.39  7.03e-  4
 8 Neighborhood_Northridge_Heights  0.145    0.0385       3.75  1.81e-  4
 9 Neighborhood_Gilbert             0.0146   0.0353       0.414 6.79e-  1
10 Neighborhood_Sawyer             -0.114    0.0280      -4.09  4.56e-  5
# … with 63 more rows
# ℹ Use `print(n = ...)` to see more rows

lm_fited is a list of class _lm & model_fit:
parallel-1
test is a list of length 1, containing a list of class _lm & model_fit, so we should use test[[1]] to refer to the model:
parallel-2

This issue should be fixed in the coming version then.

@juliasilge
Copy link
Member

Thanks for the report @icejean! 🙌

We have some tests of parallel PSOCK resampling that we run everyday, but I am noticing that we don't have any test of using tidy() in the worker; the method registration isn't working correctly in the worker. I'm going to move this over to our testing repo and we can get the bottom of this problem with S3 registration, then add a test for it.

@juliasilge juliasilge transferred this issue from tidymodels/TMwR Jan 9, 2023
@juliasilge juliasilge changed the title Parallel issue on Windows with fit_resamples() while extracting the model Problem with using tidy() (S3 registration) on Windows in parallel Jan 9, 2023
@icejean
Copy link
Author

icejean commented Jan 10, 2023

Great!

@topepo
Copy link
Member

topepo commented Aug 1, 2023

We wouldn't expect any package to be available in psock clusters (unlike multicore). To make sure that you have them available, you can load them in the extract:

get_model <- function(x) {
  library(broom). #<- add this as needed
  extract_fit_parsnip(x) %>% tidy()
}

@topepo topepo closed this as completed Aug 1, 2023
@icejean
Copy link
Author

icejean commented Aug 5, 2023

Thanks Max, it works. I've read your book 'Tidy Modeling with R' before, the issue comes from the book, it's a good book for learning the tidyverse series.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants