top_n(df, n) returns bottom n rows

The manual pages for the `top_n` function do not include any examples with date values and trying to pick out the earliest/latest of a period can be confusing.  For example, I worked in insurance so we had eligibility periods that ran from startdate to enddate.  

To get the earliest startdate, as a prior SQL programmer, I would expect to use an ascending list and the top item on the list is the first one.  However, top_n provides the "largest" date i.e. the last one.  

The ordering of an ascending list should return as the top the first item in the list. However, top_n returns the largest value, not the smallest.  This can be seen in the example below.  I am also porting the data over to SQL so you can see how this ascending order of lists, limit to the first item 1 returns differently there (in many SQL variants SELECT TOP # is supported but not SQLite).

**Reproducible Example:**
```r

library(tidyverse)

example <- data.frame(  startdate = seq(as.Date("2019/01/01"), as.Date("2019/12/31"), by="days"),
                        enddate   = seq(as.Date("2021/01/01"), as.Date("2021/12/31"), by="days") )

example[1:5, ]

###erroneous result
example %>% 
  top_n( 1, startdate)
#2019-12-31

example %>%
  select( startdate ) %>% 
  arrange( startdate ) %>% 
  top_n( 1 )
#2019-12-31

###desired solution
example %>%
  summarize( output = min(startdate) )
#2019-01-01

example %>%
  top_n( -1, startdate )

library(DBI)
db=dbConnect(RSQLite::SQLite(), ":memory:")

dbWriteTable( db, "example", example)
````

**SQL Snippet to do the same thing**
```{sql, connection=db}

SELECT startdate
FROM example
ORDER BY startdate
LIMIT 1
*2019-01-01
```

The same is true of the reverse, if you are obtaining end-date you would use a descending list from oldest to newest and pull the first item, but this pulls the "smallest" i.e. the "earliest" item. 
```r
top_n( 1, enddate )
```

However, coming from a SQL background this is counter-intuitive where I would normally query such as this:

```sql
SELECT top 1 id, startdate
FROM x
GROUP BY id
ORDER BY startdate

SELECT top 1 id, enddate
FROM x
GROUP BY id
ORDER BY desc(enddate)
```

Or alternatively, and easier if not looking for a matched set... 
```sql
SELECT id, min(startdate), max(enddate)
FROM x
GROUP BY id
```

x	a tbl() to filter
n	number of rows to return for top_n(), fraction of rows to return for top_frac().If x is grouped, this is the number (or fraction) of rows per group. Will include more rows if there are ties.If n is positive, selects the top rows. If negative, selects the bottom rows.
wt	(Optional). The variable to use for ordering. If not specified, defaults to the last variable in the tbl.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

top_n(df, n) returns bottom n rows #4494

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

top_n(df, n) returns bottom n rows #4494

Description

Activity

romainfrancois commented on Jul 17, 2019

cochetti commented on Jul 17, 2019

cochetti commented on Jul 17, 2019

cochetti commented on Jul 22, 2019

cochetti commented on Aug 4, 2019

cochetti commented on Dec 11, 2019

cochetti commented on Dec 11, 2019

cochetti commented on Dec 11, 2019

hadley commented on Dec 31, 2019

hadley commented on Jan 1, 2020

hadley commented on Jan 1, 2020

cochetti commented on Jan 15, 2020

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions