Skip to content

morph() to automatically remove columns "used up" by a mutate() #3721

Closed
@ArtemSokolov

Description

@ArtemSokolov

Dear dplyr developers,

A recent Stack Overflow question raised an interesting use case of having columns fed to a mutate() call automatically removed from the result. To do this, the mutator would need to parse the input expressions to determine what symbols were used, and I made the first pass at designing such a function. The question author liked my answer and suggested that I contribute it to dplyr.

I am happy to work on a PR with a more robust implementation, but I wanted to check with you if such a feature would align with your design principles and the spirit of the package.

Thanks. Big fan of your work.
-Artem

Activity

krlmlr

krlmlr commented on Jul 21, 2018

@krlmlr
Member
library(tidyverse)

#iris %>% transmutate(Petal.Area = Petal.Width * Petal.Length)
iris %>%
  as_tibble() %>% 
  mutate(Petal.Area = Petal.Width * Petal.Length) %>% 
  select(-Petal.Width, -Petal.Length)
#> # A tibble: 150 x 4
#>    Sepal.Length Sepal.Width Species Petal.Area
#>           <dbl>       <dbl> <fct>        <dbl>
#>  1          5.1         3.5 setosa       0.280
#>  2          4.9         3   setosa       0.280
#>  3          4.7         3.2 setosa       0.26 
#>  4          4.6         3.1 setosa       0.3  
#>  5          5           3.6 setosa       0.280
#>  6          5.4         3.9 setosa       0.68 
#>  7          4.6         3.4 setosa       0.42 
#>  8          5           3.4 setosa       0.3  
#>  9          4.4         2.9 setosa       0.280
#> 10          4.9         3.1 setosa       0.15 
#> # ... with 140 more rows

Created on 2018-07-21 by the reprex package (v0.2.0).

Thanks, I'm missing this functionality myself occasionally. Instead of parsing the expression, we could detect and record column access in the C++ code.

What should the verb do in a grouped scenario, if some groups access a different set of columns than other groups?

ArtemSokolov

ArtemSokolov commented on Jul 22, 2018

@ArtemSokolov
Author

I think there are two natural options: union - remove all columns that are accessed by at least one group, or intersection - remove only the columns that are accessed by all groups. It might be nice to be able to specify which of the two should be used, but I'm not sure where such an option would go...

krlmlr

krlmlr commented on Aug 1, 2018

@krlmlr
Member

I'd rather support only intersection, but that might be more difficult to implement than union.

The biggest problem I see with both options is that type stability is compromised -- the resulting data frame might end up with different columns, depending on the data. Perhaps the safest thing to do would be to raise an error if different columns are accessed for different groups from this verb.

What naming alternatives do we have? It might be difficult to remember the differences between mutate(), transmute() and transmutate().

ArtemSokolov

ArtemSokolov commented on Aug 1, 2018

@ArtemSokolov
Author

I agree that perhaps the appropriate action for accessing different columns across groups is to raise an error. I'm not sure I follow the type stability concerns; if the intersection of all accessed columns is removed from each group, is that not a consistent transformation of each group?

For naming alternatives, perhaps we can turn to thesaurus: https://www.thesaurus.com/browse/transmute
If it doesn't become too annoying to type, metamorphose() might be a viable option.

EDIT: A nicer alternative might be alter(), as taken from https://www.thesaurus.com/browse/mutate

krlmlr

krlmlr commented on Aug 2, 2018

@krlmlr
Member

Suppose we have two group types: X and Y, the mutator code for group X accesses column a, for Y column b. If only groups of type X are present in the data, column a is accessed and removed; if both types are present, both columns a and b are accessed -- which to remove? Both intersect and union produce results inconsistent with the first scenario.

We create something, but also take something else away. How about trade() ?

ArtemSokolov

ArtemSokolov commented on Aug 2, 2018

@ArtemSokolov
Author

Thanks for the example, Kirill. That makes sense, and raising an error seems like the best approach to maintain consistency.

I think I have a slight preference towards alter(), because it is semantically similar to mutate() and transmute(). But trade() is a good choice as well!

krlmlr

krlmlr commented on Aug 3, 2018

@krlmlr
Member

morph() ?

ArtemSokolov

ArtemSokolov commented on Aug 3, 2018

@ArtemSokolov
Author

Yes!! morph() is perfect.

vlepori

vlepori commented on Aug 8, 2018

@vlepori
dekaufman

dekaufman commented on Aug 16, 2018

@dekaufman
moodymudskipper

moodymudskipper commented on Aug 27, 2018

@moodymudskipper
mkoohafkan

mkoohafkan commented on Sep 8, 2018

@mkoohafkan
krlmlr

krlmlr commented on Sep 8, 2018

@krlmlr
Member

I like the idea of the .keep = "all" argument to mutate(), or perhaps .remove = "none" (with options "other", "used" and "used_once"). But it's rather low priority now.

29 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @hadley@ArtemSokolov@krlmlr@romainfrancois@lionel-

        Issue actions

          morph() to automatically remove columns "used up" by a mutate() · Issue #3721 · tidyverse/dplyr