Support in-place update for single-valued non-indexed non-stored numeric doc value based fields. #30433

ebernhardson · 2018-05-07T18:36:47Z

I have a specific use case to push a weekly update to 1B documents of a floating point value. This represents the popularity of an item and is used at query time as part of the scoring calculation. Currently our solution to this is to push bulk updates along with a script that noop's updates smaller than some amount that is a trade off between accuracy and % of index that is deleted and reindexed. Pushing the current value on regular document updates also helps the noop be more effective.

Specifically updatable doc values in Lucene seems like a potential solution, and Solr exposes this (with a variety of caveats). Could a very focused implementation offer the same ability to push single-valued floating point numbers into a document without a reindex operation?

elasticmachine · 2018-05-08T08:13:28Z

Pinging @elastic/es-core-infra

jpountz · 2018-05-09T07:34:15Z

We haven't exposed updateable doc values in Elasticsearch because they provide a trade-off that is hard to reason about. For instance say you update a single value of a single document. The next refresh will need to rewrite doc values for the entire segment that contains this document. If it would get exposed, there are chances that such updateable fields would be used for things like view counters, and I wouldn't be surprised that for some users doc values for all segments would need to be rewritten on every refresh, which would certainly cause write performance / scalability issues. I'm not saying we shouldn't do it at all, but it would at least require careful documentation that it isn't like an in-place update. There are other options that could be considered as well, like storing this data in some side-car data-structure so that it doesn't necessarily have to live by the rules of the Lucene index like the need to provide point-in-time snapshots.

nik9000 · 2018-05-10T20:22:55Z

Thanks @jpountz for explaining so clearly why we don't expose updateable doc values! I imagine a side car that doesn't follow the Lucene visibility rules might actually work for this but might be pretty confusing as well. I'd love for us to have something here because it is an important feature for folks that are concerned with optimizing search relevance based on frequently changing signals. Which feels like a thing we should support.

jpountz · 2018-05-10T20:56:57Z

Yeah I'm unhappy that some users resort to using things like parent/child to solve this problem, by storing the frequently-changing values in small documents that they later join at search time. It introduces other issues. I wish we provided something better. Let me try to summarize the options that I am aware of:

Use the _update API

simple, but
need to reindex the entire document
higer merging activity and (slightly) slower searches because of deletes

Use updateable doc values

significant write amplification since doc values of the updated field need to be rewritten for entire segments
fast at search time

Make doc values support stacked updates ie. writes would only write a delta and things would be resolved on read

better write scalability
slower search

Side-car data

values could be updated in-place
wouldn't support point-in-time snapshots

droberts195 · 2018-05-14T09:32:04Z

X-Pack Machine Learning currently does something very similar to the workaround in the original issue description when renormalizing anomaly scores. After renormalization we bulk index the result documents where the anomaly score has changed significantly, but leave existing results untouched where the change to the anomaly score is small.

So any changes that are made as a result of this issue could benefit ML too.

jpountz · 2018-05-17T20:31:37Z

We had a discussion about this feature, here are some notes:

Use-case:

Users seem to be mostly interested in numeric fields: metrics, relevance signals, etc.
We can't assume low volumes of updates, some users seem to be interested in updating a field across most documents of the index on a regular basis for instance.

Implementation:

We expected most challenges to be on the Lucene side, but the discussion highlighted that they are more on the Elasticsearch side actually, especially to integrate with the _update API and play well with Elasticsearch's replication model. We view the latter as an important requirement.
These fields should be in the document _source from the user perspective, but because _source doesn't support updating, they shouldn't be stored in the _source stored field but only in doc values and dynamically reintroduced into the _source json document at search time. Might be related to Better storage of _source #9034 which already discusses breaking down the _source into multiple fields.
We plan to only support single-valued numerics.
The fact that updates should work with Elasticsearch's replication model probably rules out the idea of maintaining a side-car data-structure. All modifications of the content of the index must be associated with a new document (which may be hidden to users and only used for administrative purposes) with a sequence number. This probably rules out the idea of maintaining a side-car data-structure. However we could use the new soft update API of IndexWriter to atomically update an existing document with a doc-value update and introduce a document whose only purpose is to describe the update operation with a sequence number for replication purposes.

Open questions:

Would the current implementation of doc values scale well for the workloads of our users? Stacked updates have drawbacks too, they would introduce more complexity on the merging side to keep the number of elements in the stack contained, and would make search slower. We agreed to benchmark to see how it performs for what we think will be common use-cases such as maintaining page view counts.
How could it work with the _update API? Would it need to refresh to make sure that we have an up-to-date view of the document?
Probably more questions as we dig further...

elasticmachine · 2018-07-25T14:10:27Z

Pinging @elastic/es-distributed

owenericsson · 2019-07-19T02:39:47Z

anything update for this issue ？

jpountz · 2019-07-19T08:33:18Z

No. We are keeping this idea in the back of our minds, but it is very complex to do it right, and it is not obvious whether it would actually help significantly. For instance we believe it still wouldn't be good enough to index counters.

nirmalc · 2022-12-07T19:38:46Z

Sorry for bumping an old thread. Adding three low-volume use cases for which I had to either use parent/child or do bulk updates.

Tag / untag documents in legal discovery ( document size is very expensive to reindex)
price updates ( every day) - eCommerce stores
inventory updates ( every hour )

colings86 added discuss :Data Management/Indices APIs APIs to create and manage indices and templates labels May 8, 2018

mayya-sharipova added team-discuss and removed discuss labels Jun 7, 2018

ywelsch mentioned this issue Jun 20, 2018

Partial Update : for better write performance #31467

Closed

bleskes added the :Distributed/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. label Jul 25, 2018

jasontedor assigned nik9000 Jul 25, 2018

jasontedor removed the team-discuss label Jul 25, 2018

bleskes added the high hanging fruit label Jul 25, 2018

jasontedor removed the :Data Management/Indices APIs APIs to create and manage indices and templates label Jul 25, 2018

nik9000 removed their assignment Dec 5, 2018

dnhatn mentioned this issue Jun 18, 2019

[Feature Request] In-Place Updates #43348

Closed

rjernst added the Team:Distributed Meta label for distributed team label May 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support in-place update for single-valued non-indexed non-stored numeric doc value based fields. #30433

Support in-place update for single-valued non-indexed non-stored numeric doc value based fields. #30433

ebernhardson commented May 7, 2018 •

edited

elasticmachine commented May 8, 2018

jpountz commented May 9, 2018

nik9000 commented May 10, 2018

jpountz commented May 10, 2018 •

edited

droberts195 commented May 14, 2018

jpountz commented May 17, 2018 •

edited

elasticmachine commented Jul 25, 2018

owenericsson commented Jul 19, 2019

jpountz commented Jul 19, 2019

nirmalc commented Dec 7, 2022

Support in-place update for single-valued non-indexed non-stored numeric doc value based fields. #30433

Support in-place update for single-valued non-indexed non-stored numeric doc value based fields. #30433

Comments

ebernhardson commented May 7, 2018 • edited

elasticmachine commented May 8, 2018

jpountz commented May 9, 2018

nik9000 commented May 10, 2018

jpountz commented May 10, 2018 • edited

droberts195 commented May 14, 2018

jpountz commented May 17, 2018 • edited

elasticmachine commented Jul 25, 2018

owenericsson commented Jul 19, 2019

jpountz commented Jul 19, 2019

nirmalc commented Dec 7, 2022

ebernhardson commented May 7, 2018 •

edited

jpountz commented May 10, 2018 •

edited

jpountz commented May 17, 2018 •

edited