Skip to content

Do not compute hit counts by default #33028

Closed
@jpountz

Description

@jpountz
Contributor

Context
Lucene 8 introduces optimizations that allow to compute top hits more efficiently by skipping documents that do not produce competitive scores. We would like to enable this behavior by default so that users can opt in if they need accurate total hit counts, which are costly, rather than the other way around.

Not returning a hit count at all is problematic for traditional search UIs with pagination: say that you want to display up to 5 pages with 10 hits per page, you need to know whether the hit count is between 0 and 10, between 11 and 20, between 21 and 30, between 31 and 40, or greater than 40 in order to know how many pages need to be displayed. In order to address this issue, Lucene takes a configurable threshold: if the hit count is less than this threshold, then you will get an accurate hit count, otherwise you will get a lower bound of the hit count.

We don't want to discuss backward compatibility for now, let's focus what we want to have eventually, and only then discuss how we get there to make the change easy to digest for users. That's fine if we need multiple steps and the whole change is only available in 8 rather than 7.

Response format
We need a way to tell users whether the hit count is accurate or a lower bound. Multiple ideas have been mentioned:

  1. not modify the response format: if the user asked to count up to X and the hit count is greater than X, just return X as the hit count
  2. use a string that ends with a + as a way to say that the hit count is a lower bound, eg. { "hits": { "total": "1234+" } } when the hit count is a lower bound and { "hits": { "total": 1234 } } like today otherwise
  3. use another field that tells whether the hit count is accurate or a lower bound, eg. { "hits": { "total": 1234, "total_hits_relation": "gte" } }
  4. make hits.total an object with a value and a relation, eg. { "hits": { "total": { "value": 1234, "relation": "gte" } } }
  5. make hits.total an object that has two possible keys but only one is ever set, eg. { "hits": { "total": { "gte": 1234 } } } or { "hits": { "total": { "eq": 1234 } } }
  6. don't reuse hits.total at all and return a different field if ie. hits.min or some better name.

When we discussed these options, we felt like 1 would make parsing more complicated, and with 2 it would be too easy to miss the fact that you need to look up another field in order to know how to interpret the hit count.

Implementation options

Option 1: Make track_total_hits take a number

We already have a track_total_hits switch which we added for index sorting, but currently take a boolean. It would be easy to make it take a number instead that would be the minimum number of hits to count accurately.

We could ease transition to the new response format by using the current format when track_total_hits is unset or set to a boolean, and the new format when track_total_hits is set to a number, and then deprecate support for booleans.

Users who want accurate hit counts could set a very high value for this parameter, we could potentially allow using a special value like -1 as a way to mean "be accurate".

Option 2: Hardcoded number of hits to count

Always count accurately up to index.max_result_window hits. If users need accurate hits, they will need to use an aggregation (we need to add such an aggregation that counts docs).

Ping @elastic/es-clients to get opinions about the above thoughts, especially the response format.

Activity

elasticmachine

elasticmachine commented on Aug 21, 2018

@elasticmachine
Collaborator

Pinging @elastic/es-search-aggs

jimczi

jimczi commented on Aug 23, 2018

@jimczi
Contributor

Another question that popped up is whether we should return "precise" lower bounds or just indicate the lower lower bound. If track_total_hits is set to 100 for instance, returning 123+ doesn't bring much more than returning 100+. The lower bound computed on each shard is not an approximation so any result above 100 can be misleading, a simple term query that can skip a lot of documents may return a small lower bound even if this term is present in all documents.

s1monw

s1monw commented on Aug 23, 2018

@s1monw
Contributor

I don't like changing the meaning of total or make it a string. We already return "total" : -1 if track_total_hits is false. We can add an additional parameter at_least to inform the user how many hits we can have at least. We can also add this easily to 6.x. From my perspective this would only guarantee best effort. On the request end I am in favor of option 1.

polyfractal

polyfractal commented on Aug 23, 2018

@polyfractal
Contributor

I'm also not a fan of the options where the meaning of total is context-dependent. I'd rather have a new field. In both cases you need to conditionally check two things, but at least this way incorrect assumptions are quickly noticed (total: -1) and it doesn't change the meaning of total.

It's also a bit more graceful upgrade path. UIs might be showing a bad value (-1) but it doesn't break code outright and users can implement the check for at_least/min/whatever when they get a chance.

Mpdreamz

Mpdreamz commented on Aug 23, 2018

@Mpdreamz
Member

Also prefer an additional property, total can already be skewed in the case of partial results (if allow_partial_search_results is not true (default false)) which throws people. A new property on the response is more explicit.

On the request side can we create a new property e.g track_total_hits_upto=NUMBER or total_accuracy=NUMBER and deprecatetrack_total_hits=BOOL instead of reusing track_total_hits ?

jpountz

jpountz commented on Sep 25, 2018

@jpountz
ContributorAuthor

We had a long discussion about this issue, here are some arguments that have been made:

  • we already have a big breaking change coming: the removal of types, maybe we should try to not introduce another major breaking change at the same time
  • it's too late for breaking changes in 7.0, anything that would require clients to change behavior by default would have to wait for 8.0 anyway
  • we want to enable this optimization by default eventually, because users should opt in to get more info at the expense of more CPU usage
  • option 3 is similar to 4 but is easier to parse and feels better
  • in general option 1 didn't get much support
  • option 2 raised concerns around the fact that one needs to check the value of another field (out of many) to know how to interpret the value of hits.total
  • option 0 and 2 have the benefit of not being breaking for apps that don't check hit counts
  • often apps check the hit count to know whether they need to display something, so we need to be careful with not breaking apps that work like that

Even though we couldn't make a decision, we narrowed down to two main paths:

A. Keep same format

  • In 7.0 we make track_total_hits accept numbers in addition to boolean, which acts as a threshold for the number of hits to count. If hits.total is greater than or equal to this number then its value is a lower bound of the actual hit count. hits.total is still -1 if track_total_hits is set to false like in 6.x.
  • In 7.0 we issue deprecation warnings when not setting track_total_hits
  • In 8.0 we change the default of track_total_hits to false.

One modified version of this plan that has been discussed is to not return hits.total at all when track_total_hits is set to false in 8.0, to be consistent with eg. aggs which don't have any value when not requested, and not confuse users starting with Elasticsearch that something may be broken if hits.total is not a number.

This path has the downside of making the transition easier but the response is less explicit: you need to know the threshold that was set in the request in order to know how to interpret hits.total in the response.

B. Make hits.total an object

  • In 7.0 we make track_total_hits accept numbers in addition to boolean, which acts as a threshold for the number of hits to count. If set to a number then hits.total is an object (it remains an integer by default for bw compat) that looks like { "hits": { "total": { "value": 1234, "relation": "gte" } } }
  • In version 7.0 we issue deprecation warnings when track_total_hits is not set to a number to force users to opt in for the new response format.
  • In version 8.0 we can change the default value to potentially 1000 like Lucene or 0 to disable the tracking of hit counts entirely

This path requires more changes on client side and makes it harder to perform generic processing of search responses due to the fact that the format of hits.total will depend on the request in 7.x. Yet it makes responses more self-contained.

stacey-gammon

stacey-gammon commented on Oct 22, 2018

@stacey-gammon

Just want to confirm my understanding of the latest plan:

Kibana will not have to do anything for 7.0 and everything will continue to work the same except we may see deprecation warnings.

If that changes (e.g. the decision to switch the default in 7.0 changes), or is incorrect, please let me know!

jimczi

jimczi commented on Oct 22, 2018

@jimczi
Contributor

Sorry I thought that I commented in this issue but for some reasons my comment is not there anymore.
So the current status after the internal discussion we had is that we'll disable track_total_hits by default in 7.0 and make hits.total and object with total (an integer that contains the number of hits) and relation, a string that defines how the total should be interpreted. When track_total_hits is false (default behavior) the total is filled with -1. We also discussed a way to get a response with the old format and decided that the client could send the major version in the request (as an header for instance), 7 would mean new response format and would default to "track_total_hits":false whereas 6 would behave like a 6x cluster (track_total_hits:true and hits.total returned as a simple integer). For Kibana this means that you'll have to explicitly set track_total_hits to true or use this new client versioning. The former can be tested now since the track_total_hits option is already available in 6x.

38 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @karmi@camerondavison@colings86@Mpdreamz@jpountz

        Issue actions

          Do not compute hit counts by default · Issue #33028 · elastic/elasticsearch