Closed
Description
Dear dplyr
developers,
A recent Stack Overflow question raised an interesting use case of having columns fed to a mutate()
call automatically removed from the result. To do this, the mutator would need to parse the input expressions to determine what symbols were used, and I made the first pass at designing such a function. The question author liked my answer and suggested that I contribute it to dplyr
.
I am happy to work on a PR with a more robust implementation, but I wanted to check with you if such a feature would align with your design principles and the spirit of the package.
Thanks. Big fan of your work.
-Artem
Metadata
Metadata
Assignees
Type
Projects
Milestone
Relationships
Development
No branches or pull requests
Activity
krlmlr commentedon Jul 21, 2018
Thanks, I'm missing this functionality myself occasionally. Instead of parsing the expression, we could detect and record column access in the C++ code.
What should the verb do in a grouped scenario, if some groups access a different set of columns than other groups?
ArtemSokolov commentedon Jul 22, 2018
I think there are two natural options:
union
- remove all columns that are accessed by at least one group, orintersection
- remove only the columns that are accessed by all groups. It might be nice to be able to specify which of the two should be used, but I'm not sure where such an option would go...krlmlr commentedon Aug 1, 2018
I'd rather support only
intersection
, but that might be more difficult to implement thanunion
.The biggest problem I see with both options is that type stability is compromised -- the resulting data frame might end up with different columns, depending on the data. Perhaps the safest thing to do would be to raise an error if different columns are accessed for different groups from this verb.
What naming alternatives do we have? It might be difficult to remember the differences between
mutate()
,transmute()
andtransmutate()
.ArtemSokolov commentedon Aug 1, 2018
I agree that perhaps the appropriate action for accessing different columns across groups is to raise an error. I'm not sure I follow the type stability concerns; if the
intersection
of all accessed columns is removed from each group, is that not a consistent transformation of each group?For naming alternatives, perhaps we can turn to thesaurus: https://www.thesaurus.com/browse/transmute
If it doesn't become too annoying to type,
metamorphose()
might be a viable option.EDIT: A nicer alternative might be
alter()
, as taken from https://www.thesaurus.com/browse/mutatekrlmlr commentedon Aug 2, 2018
Suppose we have two group types: X and Y, the mutator code for group X accesses column
a
, for Y columnb
. If only groups of type X are present in the data, columna
is accessed and removed; if both types are present, both columnsa
andb
are accessed -- which to remove? Both intersect and union produce results inconsistent with the first scenario.We create something, but also take something else away. How about
trade()
?ArtemSokolov commentedon Aug 2, 2018
Thanks for the example, Kirill. That makes sense, and raising an error seems like the best approach to maintain consistency.
I think I have a slight preference towards
alter()
, because it is semantically similar tomutate()
andtransmute()
. Buttrade()
is a good choice as well!krlmlr commentedon Aug 3, 2018
morph()
?ArtemSokolov commentedon Aug 3, 2018
Yes!!
morph()
is perfect.vlepori commentedon Aug 8, 2018
dekaufman commentedon Aug 16, 2018
moodymudskipper commentedon Aug 27, 2018
mkoohafkan commentedon Sep 8, 2018
krlmlr commentedon Sep 8, 2018
I like the idea of the
.keep = "all"
argument tomutate()
, or perhaps.remove = "none"
(with options"other"
,"used"
and"used_once"
). But it's rather low priority now.29 remaining items