Skip to content

outer joins don't keep join columns from both sides #4589

Closed
@skinner

Description

@skinner

The docs say

full_join() return all rows and all columns from both x and y

but it doesn't return all columns; I expect the result of this join to have four columns, but it has three:

> ta <- tibble(a=c(NA, 2, 3, 3), b=c(1, 2, 3, 4))
> tx <- tibble(x=c(3, 4, 5, NA), y=c(3, 4, 5, 6))
> dplyr::full_join(ta, tx, by=c(a="x"), na_matches = "never")
# A tibble: 7 x 3
      a     b     y
  <dbl> <dbl> <dbl>
1    NA     1    NA
2     2     2    NA
3     3     3     3
4     3     4     3
5     4    NA     4
6     5    NA     5
7    NA    NA     6

I'm interested in questions like "how many unique a have a matching x?" (and vice versa). To answer that, I'd need both a and x to exist in the output.

Presumably, dplyr collapses join columns down to one because in some joins (e.g. inner join) they'd be the same. But with outer joins they're different.

Left, right, and full joins all appear to have this behavior. This is with dplyr_0.8.1

Activity

romainfrancois

romainfrancois commented on Nov 18, 2019

@romainfrancois
Member

Maybe nest_join() can help you ?

library(dplyr, warn.conflicts = FALSE)

ta <- tibble(a=c(NA, 2, 3, 3), b=c(1, 2, 3, 4))
tx <- tibble(x=c(3, 4, 5, NA), y=c(3, 4, 5, 6))

nest_join(ta, tx, by=c(a="x"))
#> # A tibble: 4 x 3
#>       a     b tx              
#>   <dbl> <dbl> <list>          
#> 1    NA     1 <tibble [1 × 1]>
#> 2     2     2 <tibble [0 × 1]>
#> 3     3     3 <tibble [1 × 1]>
#> 4     3     4 <tibble [1 × 1]>
nest_join(tx, ta, by=c(x="a"))
#> # A tibble: 4 x 3
#>       x     y ta              
#>   <dbl> <dbl> <list>          
#> 1     3     3 <tibble [2 × 1]>
#> 2     4     4 <tibble [0 × 1]>
#> 3     5     5 <tibble [0 × 1]>
#> 4    NA     6 <tibble [1 × 1]>

Created on 2019-11-18 by the reprex package (v0.3.0.9000)

skinner

skinner commented on Nov 18, 2019

@skinner
Author

Hm, thanks for the pointer (to nest_join and to reprex!), and for looking at the issue. I ended up making copies of both join columns and then renaming them back after the join:

library(dplyr, warn.conflicts=FALSE)

ta <- tibble(a=c(NA, 2, 3, 3), b=c(1, 2, 3, 4))
tx <- tibble(x=c(3, 4, 5, NA), y=c(3, 4, 5, 6))

dplyr::full_join(
  ta %>% mutate(a_copy = a),
  tx %>% mutate(x_copy = x),
  by = c(a="x"),
  na_matches = "never"
) %>%
  dplyr::select(-a) %>%
  dplyr::rename(a = a_copy, x = x_copy)
#> # A tibble: 7 x 4
#>       b     a     y     x
#>   <dbl> <dbl> <dbl> <dbl>
#> 1     1    NA    NA    NA
#> 2     2     2    NA    NA
#> 3     3     3     3     3
#> 4     4     3     3     3
#> 5    NA    NA     4     4
#> 6    NA    NA     5     5
#> 7    NA    NA     6    NA

Created on 2019-11-18 by the reprex package (v0.3.0)

I still think that full_join should keep the original columns, because there's information there that otherwise gets lost in a full join.

Maybe there are backward-compatibility concerns at this point, though. If you do choose to keep the current behavior, I think it makes sense to document that the original columns aren't all kept.

added
featurea feature request or enhancement
tables 🧮joins and set operations
on Dec 10, 2019
hadley

hadley commented on Jan 11, 2020

@hadley
Member

I think full_join() will gain a keep argument that works similarly to nest_join(). It'll default to FALSE for backward compatibility, but you'll be able to set it to TRUE if needed.

skinner

skinner commented on Jan 11, 2020

@skinner
Author

Adding an argument (with the previous behavior as a default) makes sense for back-compat. I'm not sure exactly what "works similarly to nest_join()" means; I'd expect full_join to return a result that's the same as the one from the column-copying version in my last comment (maybe with a different ordering of columns). That's what a database would give you; I guess that's the perspective I'm coming from. Here's the postgres result, for example:

mitch=> select * from ta;
 a | b 
---+---
   | 1
 2 | 2
 3 | 3
 3 | 4
(4 rows)

mitch=> select * from tx;
 x | y 
---+---
 3 | 3
 4 | 4
 5 | 5
   | 6
(4 rows)

mitch=> select * from ta full outer join tx on ta.a = tx.x;
 a | b | x | y 
---+---+---+---
 2 | 2 |   |  
 3 | 3 | 3 | 3
 3 | 4 | 3 | 3
   | 1 |   |  
   |   | 4 | 4
   |   | 5 | 5
   |   |   | 6
(7 rows)

I'm not sure if you have a general story for how closely to match database semantics. I personally think about joins in a very database-y way; I typically avoid nested tables. So my vote (FWIW) is for the result above to at least be an option.

hadley

hadley commented on Jan 12, 2020

@hadley
Member

@skinner yes that's exactly what I was meaning.

hadley

hadley commented on Jan 12, 2020

@hadley
Member

@skinner does this look as you'd expect?

library(dplyr, warn.conflicts = FALSE)

ta <- tibble(a=c(NA, 2, 3, 3), b=c(1, 2, 3, 4))
tx <- tibble(x=c(3, 4, 5, NA), y=c(3, 4, 5, 6))
full_join(ta, tx, by = c("a" = "x"), keep = TRUE)
#> # A tibble: 6 x 4
#>       a     b     x     y
#>   <dbl> <dbl> <dbl> <dbl>
#> 1    NA     1    NA     6
#> 2     2     2    NA    NA
#> 3     3     3     3     3
#> 4     3     4     3     3
#> 5     4    NA     4     4
#> 6     5    NA     5     5

Created on 2020-01-12 by the reprex package (v0.3.0)

(I think the difference from the SQL results is NA matches NA in R, but NULL doesn't match NULL in SQL)

skinner

skinner commented on Jan 12, 2020

@skinner
Author

Hm, in that output, the a column has a 4 and a 5, which it didn't have in the input. That's not what I expected.

Given that there's an na_matches argument, I assume that passing both na_matches = "never" and keep = TRUE would give a result that matches the SQL result?

hadley

hadley commented on Jan 12, 2020

@hadley
Member

@skinner good catch, I'll have to think about that more. na_matches is currently unavailable in the dev version due to some changes in the underlying implementation (we're planning on bringing it back but it'll probably be a few weeks yet)

hadley

hadley commented on Jan 13, 2020

@hadley
Member

Ok, finally got it 😄

library(dplyr, warn.conflicts = FALSE)

df1 <- tibble(a = c(2, 3), b = c(1, 2))
df2 <- tibble(x = c(3, 4), y = c(3, 4))
full_join(df1, df2, by = c("a" = "x"))
#> # A tibble: 3 x 3
#>       a     b     y
#>   <dbl> <dbl> <dbl>
#> 1     2     1    NA
#> 2     3     2     3
#> 3     4    NA     4

full_join(df1, df2, by = c("a" = "x"), keep = TRUE)
#> # A tibble: 3 x 4
#>       a     b     x     y
#>   <dbl> <dbl> <dbl> <dbl>
#> 1     2     1    NA    NA
#> 2     3     2     3     3
#> 3    NA    NA     4     4

Created on 2020-01-13 by the reprex package (v0.3.0)

added a commit that references this issue on Jan 13, 2020
ba3ff77
skinner

skinner commented on Jan 14, 2020

@skinner
Author

Awesome, thank you! 🎉

ianmcook

ianmcook commented on Jan 16, 2020

@ianmcook
Contributor

@hadley do you think you'll implement the keep argument for left_join() and right_join() as well? In databases, left and right outer joins return the join key columns from both the left and right tables. (It's not just full outer joins that do this.) Thanks!

ianmcook

ianmcook commented on Jan 16, 2020

@ianmcook
Contributor

@hadley dbplyr includes code to make databases return results that are consistent with keep = FALSE. See https://github.com/tidyverse/dbplyr/blob/master/R/query-join.R#L90-L106; I originally contributed this in tidyverse/dbplyr#3. Time permitting, I'll open a PR to make dbplyr act differently when keep = TRUE.

hadley

hadley commented on Jan 16, 2020

@hadley
Member

@ianmcook Oh yeah, that makes sense. Would you be interested in doing a PR?

ianmcook

ianmcook commented on Jan 16, 2020

@ianmcook
Contributor

@hadley sure, I'll give it a try, following the way you implemented it in full_join() in 1e76e8a

ianmcook

ianmcook commented on Jan 16, 2020

@ianmcook
Contributor

@hadley currently in full_join() when keep = TRUE, only the join column(s) from the RHS table get the suffix (.y) appended. I think the join column(s) from the LHS table should also get the suffix (.x) appended—because they're not coalesced, so they're the values from the LHS table. Ok if I make this change too?

ianmcook

ianmcook commented on Jan 16, 2020

@ianmcook
Contributor

@hadley please take a look at #4762. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    featurea feature request or enhancementtables 🧮joins and set operations

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @hadley@skinner@ianmcook@romainfrancois

        Issue actions

          outer joins don't keep join columns from both sides · Issue #4589 · tidyverse/dplyr