-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Description
I've been wondering if dplyr needs a new set of verbs specifically designed for modifying rows identified by a set of keys:
-
insert new rows — would error if keys already exist; can specify where to add (how?). Similar to
tibble::add_row()
. -
update values — overrides existing values. (Similar to Dataset patching tools tidyr#183)
-
patch values — like update, but only replaces missing values (Also similar to Dataset patching tools tidyr#183)
-
upsert — update or insert depending on presence/absence of keys
Basic sketch
df <- tribble(
~ x, ~ y,
1, 2,
2, NA,
)
# Insert can add multiple rows
df %>% insert(x = 3:4, y = 5)
#> x y
#> <dbl> <dbl>
#> 1 1 2
#> 2 2 NA
#> 3 3 5
#> 4 4 5
# And new columns:
df %>% insert(x = 3, z = 1)
#> x y z
#> <dbl> <dbl> <dbl>
#> 1 1 2 NA
#> 2 2 NA NA
#> 3 3 NA 1
#> Error: key variable `x` is missing
df %>% insert(x = 1, y = 5)
#> Error: key `x = 1` already present
# Update modifies existing rows:
df %>% update(x = 1, y = 3)
#> x y
#> <dbl> <dbl>
#> 1 1 3
#> 2 2 NA
# Erroring if there's no match:
df %>% update(x = 3, y = 3)
#> Error: key `x = 3` doesn't exist
# patch() works like update but only changes missing values
df %>% patch(x = 1:2, y = 10)
#> x y
#> <dbl> <dbl>
#> 1 1 3
#> 2 2 10
# upsert() updates or inserts as needed:
df %>% upsert(x = c(1, 3), y = 3)
#> x y
#> <dbl> <dbl>
#> 1 1 3
#> 2 2 NA
#> 3 3 3
Thoughts
-
All functions would either take multiple named inputs or a single unnamed data frame.
-
I think unlike SQL and
==
, it makes sense forNA
to matchNA
in keys. That's consistent withmatch()
. -
If
.key
were ordered (can we determine that quickly?)insert()
could insert new rows in right place. Otherwise would be have.before
and.after
that would be a single row data frame of key values? -
.key
would work like.by
; defaulting to common variables with a message. But unlike joins, would check key set for uniqueness, erroring if not (i.e. no Cartesian products). That meansinsert(x, y)
would always yieldnrow(x) + nrow(y)
andupdate(x, y)
would always yieldnrow(x)
. -
Should all these functions get a
rows
prefix? Or suffix?rows_insert()
,rows_update()
,rows_patch()
, ...? Does that suggest there's an equivalent set ofcol
functions?cols_insert()
(a single col join, like a vlookup? error if col already exists?),cols_move()
(A dedicated verb to move columns within a table #4598),cols_patch()
(df form ofcoalesce()
?)
Backends
- Grouping would be ignored (but recomputed if needed)
- Would error for dbplyr as modifying data is out of scope.
- Could modifying in place for dtplyr when
immutable = FALSE
- dm objects would use predefined primary keys and error if
.key
supplied - An imaginary mutable googlesheet could perform these operations on the underlying sheet
Activity
juangomezduaso commentedon Dec 18, 2019
Synthesis with tidyverse/dplyr#4654
billdenney commentedon Apr 21, 2020
In the "backends" section, you note "Grouping would be ignored...". How is the key determined if not by grouping? Is it always that the key = the first argument? Would there be a mechanism for multi-column keys (e.g. country/state(province)/city could want grouping at the country level or at the country/state level)?
krlmlr commentedon Apr 23, 2020
Groupings are different from keys. There will be support for multi-column (compound/composite) keys.
PR coming soon.
rows_() (#5142)