Description
Hi,
In my ElasticSearch clusters, write and search are both heavy. And the document in the cluster will have many many fields, While just some of them are frequently searched(we named it as hot-search-field). We hope that these kinds of search can achieve better performance to avoid the response time increasing because of the segment number araising.
And we found that search can achieve much better performance after merging to less segments because of less segment scans and Lucene's cache design (it just cache the DocIdSet which is from the most major segment ) .
Now Lucene's Segment design is based on row model (or document model). I wander that if we make Segment re-design to be based on field model (or field family model), so that the hot-search-fields can have more cpu resources, and have frequent segment merges to make the number of segments down to a very small number. If so, ElasticSearch / Lucene can achieve much better performance when the queries with hot-search-fields, especially when ElasticSearch cluster with large amount of bulk requests.
this design need to deep into Lucene segment, maybe include live files, refresh, merge, segment meta, index buffer
Activity
elasticmachine commentedon Jun 20, 2018
Pinging @elastic/es-search-aggs
jpountz commentedon Jun 20, 2018
Actually this is not true, only stored fields have a row model. Other data-structures like the inverted index and doc values, which are the most used ones when it comes to running queries/aggregations are per-field. So your suggestion is actually the way that things are working today already, to the exception of stored fields, but they usually don't matter for performance.
xzhthu2018 commentedon Jun 21, 2018
@jpountz i think row model means that a segment contains all fields (with its invert index, doc values, store fields). so that when es do a segment merge, it have to merge and re-build all the fields together, not merge the hot-search fields first. if so, the merge speed of the hot-search fields will be slowed down.
xzhthu2018 commentedon Jun 21, 2018
@jpountz 3ks for your explanation.
for example:
in a segment#1 with following fields:
id:1(with doc_value,invert_index)
hot_search_field:1(with doc_value,invert_index)
cold_search_field:1(with doc_value,invert_index)
in a segment#2 with following fields:
id:2(with doc_value,invert_index)
hot_search_field:2(with doc_value,invert_index)
cold_search_field:2(with doc_value,invert_index)
when i do the segment, segment#1 and segment#2 merge together, the hot_search_field and the cold_search_field will be merged together. But actually, the cold one is no need to merge first. If we spend more cpu on the hot_search_field merge, when we search on hot_search_field ,we can achieve better performance
jpountz commentedon Jun 22, 2018
@xzhthu2018 In any case, we cannot publish a merge until all fields have been merged so your idea wouldn't work. The thing that is closest to your needs that I can think about would be for you to have one additional index that only has the hot search fields and use it to search whenever none of the cold fields are needed. I'm not sure how practical it would be however.