Description
The manual pages for the top_n
function do not include any examples with date values and trying to pick out the earliest/latest of a period can be confusing. For example, I worked in insurance so we had eligibility periods that ran from startdate to enddate.
To get the earliest startdate, as a prior SQL programmer, I would expect to use an ascending list and the top item on the list is the first one. However, top_n provides the "largest" date i.e. the last one.
The ordering of an ascending list should return as the top the first item in the list. However, top_n returns the largest value, not the smallest. This can be seen in the example below. I am also porting the data over to SQL so you can see how this ascending order of lists, limit to the first item 1 returns differently there (in many SQL variants SELECT TOP # is supported but not SQLite).
Reproducible Example:
library(tidyverse)
example <- data.frame( startdate = seq(as.Date("2019/01/01"), as.Date("2019/12/31"), by="days"),
enddate = seq(as.Date("2021/01/01"), as.Date("2021/12/31"), by="days") )
example[1:5, ]
###erroneous result
example %>%
top_n( 1, startdate)
#2019-12-31
example %>%
select( startdate ) %>%
arrange( startdate ) %>%
top_n( 1 )
#2019-12-31
###desired solution
example %>%
summarize( output = min(startdate) )
#2019-01-01
example %>%
top_n( -1, startdate )
library(DBI)
db=dbConnect(RSQLite::SQLite(), ":memory:")
dbWriteTable( db, "example", example)
SQL Snippet to do the same thing
SELECT startdate
FROM example
ORDER BY startdate
LIMIT 1
*2019-01-01
The same is true of the reverse, if you are obtaining end-date you would use a descending list from oldest to newest and pull the first item, but this pulls the "smallest" i.e. the "earliest" item.
top_n( 1, enddate )
However, coming from a SQL background this is counter-intuitive where I would normally query such as this:
SELECT top 1 id, startdate
FROM x
GROUP BY id
ORDER BY startdate
SELECT top 1 id, enddate
FROM x
GROUP BY id
ORDER BY desc(enddate)
Or alternatively, and easier if not looking for a matched set...
SELECT id, min(startdate), max(enddate)
FROM x
GROUP BY id
Activity
romainfrancois commentedon Jul 17, 2019
Not sure I understand,
top_n(1, startdate)
gives you the row with the1
greateststartdate
value .same as with numbers:
cochetti commentedon Jul 17, 2019
If you don't understand why did you close the ticket? The behavior of top_n is opposite that of what it does in SQL.
cochetti commentedon Jul 17, 2019
Nowhere in the manual entry is the word "GREATEST" used.... So it is natural to interpret top_n to comply with how it is in SQL, where it gives you the top entry among an ordered list... ascending order is 1, 2, 3, 4.... so top_n gives 1 the first entry in the list, not the greatest
cochetti commentedon Jul 22, 2019
Thank you. I appreciate the reconsideration and apologize for my abrupt comment about not understanding.
cochetti commentedon Aug 4, 2019
Quick note... This documentation issue also relates to the concern raised in #1008 (issues/1008) where the implementation doesn't "play well" with explicitly using ascending or descending. The commenters also had issues as they were interpreting it to be similar to SQL, where the expected results were based on sorting of the data.
SQL processes
top n
orlimit
statements based on the ordering specified in theORDER BY
clause. When SQL programmers switch to the tidyverse, they are expectingtop_n
to behave in the same way, i.e.arrange( x ) %>% top_n(10)
will give the first 10 values among a list achieved with the prior ascending sort. Note klmr's comment on #1008 where the expectation is thatarrange( pval ) %>% head(100)
will be equivalent totop_n( 100, desc(pval) )
--- this is the same SQL-based assumption of which side is the 'top'.'Top' is confusing as it relies on a mutual understanding of which end of the list is the top after sorting ascending or descending. If you used a deck of cards as an example, if you place the cards face down on a table when you sort them, the 'top' are the last ones in the sorting, i.e. the greatest -- as
top_n
currently returns. Alternatively, after sorting if you are holding the cards in sorted order, the 'top' are the first x cards you would see, i.e. the least -- as SQLTOP N
code would return. Base R avoids this confusion as most folks usehead()
andtail()
which explicitly state which end is being used.TL;DR
dplyr::top_n
is backward to SQLTOP N
. It's confused a bunch of SQL converts.head()
andtail()
are more explicit in which "side" of the sort is being pulled from.cochetti commentedon Dec 11, 2019
I have a better example to explain this... You order a deck of cards first by suit (Club, Diamond, Heart, Spade) and then by rank/pip (A, 2 - 10, J, Q, K.... assume these are coded 1 to 13). Okay so we have Clubs 1, Clubs 2, Clubs 3 .... to Spades 11, Spades 12, Spades 13.
In SQL, the sort is issued with ORDER BY which would output a datatable starting with Clubs 1 in row 1 or on the "top" of the pile facing upwards so you can see the face of the card when placed on a table. SELECT TOP 1 SUIT, RANK FROM tblDeck ORDER BY SUIT, RANK then pulls Clubs 1.
In tidyverse, it appears that top_n( 1, SUIT, RANK ) will return Spades 13. In order to get the corresponding card top_n( -1, SUIT, RANK) would need to be used. This is why I claim they are "backward" to one another.
cochetti commentedon Dec 11, 2019
Also, with SQL the ORDER BY is preserved when the data is stored.
It is more natural for a SQL convert to desire to issue arrange( ) expecting the row 1 Club 1 to row 52 Spades 13. So in the next command top_n( 1 ) a SQL convert would expect this to return Club 1.
cochetti commentedon Dec 11, 2019
This applies with dates the same way... 1/1/1900, 1/1/1901, 1/1/1902... 1/1/2000
SELECT TOP 1 date FROM tblData ORDER BY date -> returns 1/1/1900.
top_n( tblData, 1, date) -> returns the top/highest value 1/1/2000, not the row 1 in a sorted table
[-]top_n with date values not returning appropriately[/-][+]top_n(df, n) returns bottom n rows[/+]hadley commentedon Dec 31, 2019
@cochetti thanks for the detailed analysis and explanation.
I think the other thing that's confusing is that if you do specify
df %>% arrange(x) %>% top_n(2)
, the arrange is actually ignored becausetop_n()
sorts by the last variable in the data frame. This result is confusing:Created on 2019-12-31 by the reprex package (v0.3.0)
This convinces me that the existing interface to
top_n()
is wrong, but it remains to figure out how to fix it without breaking existing code. I think the easiest way will be to retire (or maybe deprecate) the existingtop_n()
and replace it with a new function. But what should that function be called?Some brainstorming:
first_n()
andlast_n()
(and eliminate the use of negativen
); but would also requirefirst_frac()
andlast_frac()
, and four functions for this purpose seems like a lot. (Another option would be allowfrac(0.5)
to work). Similarlystart_n()
/beginning_n()
andend_n()
Something that doesn't indicate directionality:
edge()
,margin()
,periphery()
,extrema()
,fringe()
,outside()
Some minor variation on
top_n()
:top()
,top_n2()
,topn()
Of these options
top()
seems most appealing, but because it's so short the likelihood of a clash with another package is probably high.(This function would also sort by the
wt
variable, if supplied, as requested in #4660. Probably should also includewith_ties
argument to match SQL. It also wouldn't evaluaten
, but keep it fixed.)hadley commentedon Jan 1, 2020
Oooh, maybe
slice_top()
andslice_bottom()
(and thenslice_random()
could replacesample_n()
andsample_frac())
. Maybe have mutually exclusiven
andprop
arguments to avoid the two suffixes.hadley commentedon Jan 1, 2020
slice_head()
andslice_tail()
would be even better:Retire top_n()
cochetti commentedon Jan 15, 2020
I really like the solution of calling it slice_xxxx... I think it resolves the "top" terminology confusion that exists even in SQL. And the addition of slice_random is a nice complementary implementation.
FYI, I am working on a suggestion for how to annotate functions where an issue like this one is found and a solution has not yet been implemented or won't be.