Closed
Description
The docs say
full_join() return all rows and all columns from both x and y
but it doesn't return all columns; I expect the result of this join to have four columns, but it has three:
> ta <- tibble(a=c(NA, 2, 3, 3), b=c(1, 2, 3, 4))
> tx <- tibble(x=c(3, 4, 5, NA), y=c(3, 4, 5, 6))
> dplyr::full_join(ta, tx, by=c(a="x"), na_matches = "never")
# A tibble: 7 x 3
a b y
<dbl> <dbl> <dbl>
1 NA 1 NA
2 2 2 NA
3 3 3 3
4 3 4 3
5 4 NA 4
6 5 NA 5
7 NA NA 6
I'm interested in questions like "how many unique a
have a matching x
?" (and vice versa). To answer that, I'd need both a
and x
to exist in the output.
Presumably, dplyr collapses join columns down to one because in some joins (e.g. inner join) they'd be the same. But with outer joins they're different.
Left, right, and full joins all appear to have this behavior. This is with dplyr_0.8.1
Metadata
Metadata
Assignees
Type
Projects
Milestone
Relationships
Development
No branches or pull requests
Activity
romainfrancois commentedon Nov 18, 2019
Maybe
nest_join()
can help you ?Created on 2019-11-18 by the reprex package (v0.3.0.9000)
skinner commentedon Nov 18, 2019
Hm, thanks for the pointer (to nest_join and to reprex!), and for looking at the issue. I ended up making copies of both join columns and then renaming them back after the join:
Created on 2019-11-18 by the reprex package (v0.3.0)
I still think that
full_join
should keep the original columns, because there's information there that otherwise gets lost in a full join.Maybe there are backward-compatibility concerns at this point, though. If you do choose to keep the current behavior, I think it makes sense to document that the original columns aren't all kept.
hadley commentedon Jan 11, 2020
I think
full_join()
will gain akeep
argument that works similarly tonest_join()
. It'll default toFALSE
for backward compatibility, but you'll be able to set it toTRUE
if needed.skinner commentedon Jan 11, 2020
Adding an argument (with the previous behavior as a default) makes sense for back-compat. I'm not sure exactly what "works similarly to
nest_join()
" means; I'd expect full_join to return a result that's the same as the one from the column-copying version in my last comment (maybe with a different ordering of columns). That's what a database would give you; I guess that's the perspective I'm coming from. Here's the postgres result, for example:I'm not sure if you have a general story for how closely to match database semantics. I personally think about joins in a very database-y way; I typically avoid nested tables. So my vote (FWIW) is for the result above to at least be an option.
hadley commentedon Jan 12, 2020
@skinner yes that's exactly what I was meaning.
hadley commentedon Jan 12, 2020
@skinner does this look as you'd expect?
Created on 2020-01-12 by the reprex package (v0.3.0)
(I think the difference from the SQL results is
NA
matchesNA
in R, butNULL
doesn't matchNULL
in SQL)skinner commentedon Jan 12, 2020
Hm, in that output, the
a
column has a 4 and a 5, which it didn't have in the input. That's not what I expected.Given that there's an
na_matches
argument, I assume that passing bothna_matches = "never"
andkeep = TRUE
would give a result that matches the SQL result?hadley commentedon Jan 12, 2020
@skinner good catch, I'll have to think about that more.
na_matches
is currently unavailable in the dev version due to some changes in the underlying implementation (we're planning on bringing it back but it'll probably be a few weeks yet)hadley commentedon Jan 13, 2020
Ok, finally got it 😄
Created on 2020-01-13 by the reprex package (v0.3.0)
Implement keep = T in full_join()
skinner commentedon Jan 14, 2020
Awesome, thank you! 🎉
ianmcook commentedon Jan 16, 2020
@hadley do you think you'll implement the
keep
argument forleft_join()
andright_join()
as well? In databases, left and right outer joins return the join key columns from both the left and right tables. (It's not just full outer joins that do this.) Thanks!ianmcook commentedon Jan 16, 2020
@hadley dbplyr includes code to make databases return results that are consistent with
keep = FALSE
. See https://github.com/tidyverse/dbplyr/blob/master/R/query-join.R#L90-L106; I originally contributed this in tidyverse/dbplyr#3. Time permitting, I'll open a PR to make dbplyr act differently whenkeep = TRUE
.hadley commentedon Jan 16, 2020
@ianmcook Oh yeah, that makes sense. Would you be interested in doing a PR?
ianmcook commentedon Jan 16, 2020
@hadley sure, I'll give it a try, following the way you implemented it in
full_join()
in 1e76e8aianmcook commentedon Jan 16, 2020
@hadley currently in
full_join()
whenkeep = TRUE
, only the join column(s) from the RHS table get the suffix (.y
) appended. I think the join column(s) from the LHS table should also get the suffix (.x
) appended—because they're not coalesced, so they're the values from the LHS table. Ok if I make this change too?ianmcook commentedon Jan 16, 2020
@hadley please take a look at #4762. Thanks!