Skip to content

Problem with using tidy() (S3 registration) on Windows in parallel #64

Closed
@icejean

Description

@icejean

Hi,all,
I'm reading this book to get to know with tidymodels, and get an problem with running parallel resampling on Windows.
As to the examples in Chapter 10, the main parallel part of fit_resamples() is O.K. with library doParallel on Windows, but only one issue with extract_fit_parsnip(x):

library(tidymodels)
# All operating systems
library(doParallel)
library(kableExtra)
library(tidyr)
tidymodels_prefer()

data(ames)
ames <- mutate(ames, Sale_Price = log10(Sale_Price))

set.seed(502)
ames_split <- initial_split(ames, prop = 0.80, strata = Sale_Price)
ames_train <- training(ames_split)
ames_test  <-  testing(ames_split)

ames_rec <- 
  recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type + 
           Latitude + Longitude, data = ames_train) %>%
  step_log(Gr_Liv_Area, base = 10) %>% 
  step_other(Neighborhood, threshold = 0.01) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  step_interact( ~ Gr_Liv_Area:starts_with("Bldg_Type_") ) %>% 
  step_ns(Latitude, Longitude, deg_free = 20)

lm_model <- linear_reg() %>% set_engine("lm")

lm_wflow <- 
  workflow() %>% 
  add_model(lm_model) %>% 
  add_recipe(ames_rec)

lm_fit <- fit(lm_wflow, ames_train)

# This line is O.K.
extract_fit_parsnip(lm_fit) %>% tidy()

# ------------------------------------------------------------------------------------
set.seed(1001)
ames_folds <- vfold_cv(ames_train, v = 10)

# Create a cluster object and then register: 
# cl <- makePSOCKcluster(parallel::detectCores())
cl <- makePSOCKcluster(10)
registerDoParallel(cl)

get_model <- function(x) {
  # not O.K. on Windows & Linux.
  extract_fit_parsnip(x) %>% tidy()
  # This line is O.K.
  # extract_recipe(x, estimated = TRUE)
}

ctrl <- control_resamples(save_pred=TRUE,verbose=TRUE, extract = get_model)
set.seed(1003)
lm_res <- lm_wflow %>% fit_resamples(resamples = ames_folds, control = ctrl)

# Stop parallel
stopCluster(cl)

#  These lines are O.K.
lm_res
lm_res$.metrics[[1]]
lm_res$.notes[[1]]
lm_res$.predictions[[1]]
> lm_res$.extracts[[1]]
# A tibble: 1 x 2
  .extracts      .config             
  <list>         <chr>               
1 <try-errr [1]> Preprocessor1_Model1

> # To get the results
> lm_res$.extracts[[1]][[1]]
[[1]]
[1] "Error in UseMethod(\"tidy\") : \n  no applicable method for 'tidy' applied to an object of class \"lm\"\n"
attr(,"class")
[1] "try-error"
attr(,"condition")
<simpleError in UseMethod("tidy"): no applicable method for 'tidy' applied to an object of class "lm">

Any idea?

Best regards.

Activity

icejean

icejean commented on Dec 24, 2022

@icejean
Author

Same results on both R-4.2.2 and R-4.1.2.

icejean

icejean commented on Dec 25, 2022

@icejean
Author

Here's a workaround, the problem is function extract_fit_parsnip(x) returns a list but not a parsnip model when using library(doParallel) instead of library(doMC):

library(tidymodels)
# All operating systems
library(doParallel)
library(kableExtra)
library(tidyr)
tidymodels_prefer()

data(ames)
ames <- mutate(ames, Sale_Price = log10(Sale_Price))

set.seed(502)
ames_split <- initial_split(ames, prop = 0.80, strata = Sale_Price)
ames_train <- training(ames_split)
ames_test  <-  testing(ames_split)

ames_rec <- 
  recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type + 
           Latitude + Longitude, data = ames_train) %>%
  step_log(Gr_Liv_Area, base = 10) %>% 
  step_other(Neighborhood, threshold = 0.01) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  step_interact( ~ Gr_Liv_Area:starts_with("Bldg_Type_") ) %>% 
  step_ns(Latitude, Longitude, deg_free = 20)

lm_model <- linear_reg() %>% set_engine("lm")

lm_wflow <- 
  workflow() %>% 
  add_model(lm_model) %>% 
  add_recipe(ames_rec)

lm_fit <- fit(lm_wflow, ames_train)

# This line is O.K.
lm_fited <- extract_fit_parsnip(lm_fit)
tidy(lm_fited)

# ------------------------------------------------------------------------------------
set.seed(1001)
ames_folds <- vfold_cv(ames_train, v = 10)

# Create a cluster object and then register: 
# cl <- makePSOCKcluster(parallel::detectCores())
cl <- makePSOCKcluster(10)
registerDoParallel(cl)

get_model <- function(x) {
  # extract_fit_parsnip(x) %>% tidy()
  extract_fit_parsnip(x)
}

ctrl <- control_resamples(save_pred=TRUE,verbose=TRUE, extract = get_model)
set.seed(1003)
lm_res <- lm_wflow %>% fit_resamples(resamples = ames_folds, control = ctrl)

# Stop parallel
stopCluster(cl)

lm_res$.extracts[[1]]
# To get the results
lm_res$.extracts[[1]][[1]]

# -----------------------------------------------------------------------------------
lm_fited
lm_fited %>% tidy()

# To get the results
test<-lm_res$.extracts[[1]][[1]]
test
tidy(test[[1]])

> tidy(test[[1]])
# A tibble: 73 × 5
   term                            estimate std.error statistic   p.value
   <chr>                              <dbl>     <dbl>     <dbl>     <dbl>
 1 (Intercept)                     -0.140    0.308       -0.455 6.49e-  1
 2 Gr_Liv_Area                      0.620    0.0167      37.1   8.50e-231
 3 Year_Built                       0.00177  0.000143    12.3   9.98e- 34
 4 Neighborhood_College_Creek      -0.0125   0.0358      -0.348 7.28e-  1
 5 Neighborhood_Old_Town           -0.0643   0.0133      -4.84  1.38e-  6
 6 Neighborhood_Edwards            -0.102    0.0297      -3.43  6.20e-  4
 7 Neighborhood_Somerset            0.0709   0.0209       3.39  7.03e-  4
 8 Neighborhood_Northridge_Heights  0.145    0.0385       3.75  1.81e-  4
 9 Neighborhood_Gilbert             0.0146   0.0353       0.414 6.79e-  1
10 Neighborhood_Sawyer             -0.114    0.0280      -4.09  4.56e-  5
# … with 63 more rows
# ℹ Use `print(n = ...)` to see more rows

lm_fited is a list of class _lm & model_fit:
parallel-1
test is a list of length 1, containing a list of class _lm & model_fit, so we should use test[[1]] to refer to the model:
parallel-2

This issue should be fixed in the coming version then.

juliasilge

juliasilge commented on Jan 9, 2023

@juliasilge
Member

Thanks for the report @icejean! 🙌

We have some tests of parallel PSOCK resampling that we run everyday, but I am noticing that we don't have any test of using tidy() in the worker; the method registration isn't working correctly in the worker. I'm going to move this over to our testing repo and we can get the bottom of this problem with S3 registration, then add a test for it.

transferred this issue fromtidymodels/TMwRon Jan 9, 2023
changed the title [-]Parallel issue on Windows with fit_resamples() while extracting the model[/-] [+]Problem with using `tidy()` (S3 registration) on Windows in parallel[/+] on Jan 9, 2023
icejean

icejean commented on Jan 10, 2023

@icejean
Author

Great!

topepo

topepo commented on Aug 1, 2023

@topepo
Member

We wouldn't expect any package to be available in psock clusters (unlike multicore). To make sure that you have them available, you can load them in the extract:

get_model <- function(x) {
  library(broom). #<- add this as needed
  extract_fit_parsnip(x) %>% tidy()
}
icejean

icejean commented on Aug 5, 2023

@icejean
Author

Thanks Max, it works. I've read your book 'Tidy Modeling with R' before, the issue comes from the book, it's a good book for learning the tidyverse series.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @topepo@juliasilge@icejean

        Issue actions

          Problem with using `tidy()` (S3 registration) on Windows in parallel · Issue #64 · tidymodels/extratests