Skip to content

Consider new row mutation functions #4654

@hadley

Description

@hadley
Member

I've been wondering if dplyr needs a new set of verbs specifically designed for modifying rows identified by a set of keys:

  • insert new rows — would error if keys already exist; can specify where to add (how?). Similar to tibble::add_row().

  • update values — overrides existing values. (Similar to Dataset patching tools tidyr#183)

  • patch values — like update, but only replaces missing values (Also similar to Dataset patching tools tidyr#183)

  • upsert — update or insert depending on presence/absence of keys

Basic sketch

df <- tribble(
  ~ x, ~ y,
  1, 2,
  2, NA,
)

# Insert can add multiple rows
df %>% insert(x = 3:4, y = 5)
#>       x     y
#>   <dbl> <dbl>
#> 1     1     2
#> 2     2    NA
#> 3     3     5
#> 4     4     5

# And new columns:
df %>% insert(x = 3, z = 1)
#>       x     y     z
#>   <dbl> <dbl> <dbl>
#> 1     1     2    NA
#> 2     2    NA    NA
#> 3     3    NA     1

#> Error: key variable `x` is missing
df %>% insert(x = 1, y = 5)
#> Error: key `x = 1` already present

# Update modifies existing rows:
df %>% update(x = 1, y = 3)
#>       x     y
#>   <dbl> <dbl>
#> 1     1     3
#> 2     2    NA

# Erroring if there's no match:
df %>% update(x = 3, y = 3)
#> Error: key `x = 3` doesn't exist

# patch() works like update but only changes missing values
df %>% patch(x = 1:2, y = 10)
#>       x     y
#>   <dbl> <dbl>
#> 1     1     3
#> 2     2    10

# upsert() updates or inserts as needed:
df %>% upsert(x = c(1, 3), y = 3)
#>       x     y
#>   <dbl> <dbl>
#> 1     1     3
#> 2     2    NA
#> 3     3     3

Thoughts

  • All functions would either take multiple named inputs or a single unnamed data frame.

  • I think unlike SQL and ==, it makes sense for NA to match NA in keys. That's consistent with match().

  • If .key were ordered (can we determine that quickly?) insert() could insert new rows in right place. Otherwise would be have .before and .after that would be a single row data frame of key values?

  • .key would work like .by; defaulting to common variables with a message. But unlike joins, would check key set for uniqueness, erroring if not (i.e. no Cartesian products). That means insert(x, y) would always yield nrow(x) + nrow(y) and update(x, y) would always yield nrow(x).

  • Should all these functions get a rows prefix? Or suffix? rows_insert(), rows_update(), rows_patch(), ...? Does that suggest there's an equivalent set of col functions? cols_insert() (a single col join, like a vlookup? error if col already exists?), cols_move() (A dedicated verb to move columns within a table #4598), cols_patch() (df form of coalesce()?)

Backends

  • Grouping would be ignored (but recomputed if needed)
  • Would error for dbplyr as modifying data is out of scope.
  • Could modifying in place for dtplyr when immutable = FALSE
  • dm objects would use predefined primary keys and error if .key supplied
  • An imaginary mutable googlesheet could perform these operations on the underlying sheet

Activity

juangomezduaso

juangomezduaso commented on Dec 18, 2019

@juangomezduaso
added a commit that references this issue on Apr 20, 2020
billdenney

billdenney commented on Apr 21, 2020

@billdenney
Contributor

In the "backends" section, you note "Grouping would be ignored...". How is the key determined if not by grouping? Is it always that the key = the first argument? Would there be a mechanism for multi-column keys (e.g. country/state(province)/city could want grouping at the country level or at the country/state level)?

krlmlr

krlmlr commented on Apr 23, 2020

@krlmlr
Member

Groupings are different from keys. There will be support for multi-column (compound/composite) keys.

PR coming soon.

added a commit that references this issue on Apr 30, 2020
7dc9cc6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      Participants

      @hadley@krlmlr@juangomezduaso@billdenney

      Issue actions

        Consider new row mutation functions · Issue #4654 · tidyverse/dplyr