-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[Improvement] historical fast restart by lazy load columns metadata(20X faster) #6988
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hi @pzhdfy, have you had a chance to test |
Interesting idea, I too have felt the pain of multi-day rollouts for tiers of densely loaded historical servers. I've only scanned changes so far, I'll try to do a full review later. In the mean time, if possible could you run query benchmarks before this patch and after this patch with I would also be interested in the performance cost for queries that do have to eat the deserialization when |
IMO a lazy loading option is nice for very dense historical nodes. So I am supportive of this idea. It should be off by default, though, since it defers work from startup to query time, and for a medium/low-density historical node you'd prefer to do that work at startup to keep queries fast. |
historical size: 100k segments and 10TB size 1. without this patch 1) druid.segmentCache.numBootstrapThreads = 1 2. with this patch and lazyLoadOnStart = false 3. with this patch and lazyLoadOnStart = true |
@pzhdfy I did similar lazy load thing in my code base, but I found there is a consequence. If historical node is force killed(kill -9) when it is asked to download new segment, it is very likely unzip failure and left a corrupted segment folder. When lazy load=false, this segment can be ignored during historical startup(loading will fail), but when lazy load=true, this corruption is only known when a query comes in. Currently there is no interface to tell SegmentLoaderLocalCacheManager to unload this segment or reload this segment, thus queries always fail. Currently the only solution is to delete this segment folder and restart historical node to ignore this segment(because folder is gone). Better introduce unload/reload interface together in this PR. |
This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@druid.apache.org list. Thank you for your contributions. |
commenting as i'm not sure we want this to go stale or not? |
@bputt thanks for commenting! I think this PR is worth. Would you please fix the conflicts? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, feature sounds useful.
I think, in your case, Historicals don't have enough memory to have all segments in page cache or else on process restart much disk activity shouldn't be expected because segment data would be in the page cache. if you do have enough memory then the behavior described in PR description looks suspicious and maybe there is something else going on.
processing/src/main/java/org/apache/druid/segment/SimpleQueryableIndex.java
Show resolved
Hide resolved
server/src/main/java/org/apache/druid/segment/loading/SegmentLoaderConfig.java
Show resolved
Hide resolved
Recently,I want to apply this PR to 0.12.1;but have many conflicts;
I want to Know which point fork from for the historical_fast_restart branch?
|
@JackyYangPassion |
@pzhdfy yes! I have manually this PR for 0.12.1 ; |
apply this pr to 0.12.1 |
I apply it to 0.12.3, but inoperative. I just apply and restart on one historical node of cluster. |
Just on historical. |
@pzhdfy does this change push the work to query time for the first time a segment is queried only? Or is this metadata loading pushed to query time for every query run against segments instead of only at historical startup? |
just push the work to query time for the first time a segment is queried only |
@clintropolis @gianm @pzhdfy We love this patch. It makes our lives much easier when rolling out changes and running through upgrades.
I am nervous about this though. As of now we have accepted the risk and are willing to intervene when needed. Do you have any plans of addressing this comment in this PR or would this be an additional PR. I do think that it would get more support if we did analysis on this and solved it if it is indeed a problem.
Regarding Clint's comment on performance. We have been running it for nearly a week on a cluster with hundreds of thousands of segments and hundreds of thousands of queries a day and our metrics collection shows negligible change in performance (but this is vs druid 11. If druid 15 had large performance gains across the board before this patch there could be a bigger change). We added this as a part of our upgrade to druid 15 though so we have not seen performance with it off in druid 15, just in druid 11. Regardless, we are eager to get this reviewed and accepted upstream. Our experience with it so far has been great and we'd love to help in whatever way possible to get it merged. |
ec4f0d4
to
7d11ca2
Compare
Throwing another comment at this to make sure it doesn't get marked as stale. We continue to run in production with no apparent issues so far (cluster with 100s of thousands of segments). Allowed us to complete a rolling prod upgrade of 70+ historical nodes in a single day which was not possible for us before. If there is anything that @mohammadjkhan or I can do to help get this moving forward again, let us know! |
Hi @pzhdfy, my apologies for totally forgetting about this PR. Could you please fix the conflicts and address the comments @himanshug had, at least about adding this to the documentation (I think here would be most appropriate), so we can try to get this merged? Overall lgtm 👍 |
ok, I will fix conflicts and add documentation later |
@clintropolis done |
I did some testing with the latest version of this PR and everything is still working me. I also noticed that segment metadata queries performed by the broker to collect schema information for Druid SQL will likely do a fair bit of the work of loading the segments after the historical initializes, by lucky side-effect. Though I suspect that will not cover all replicas, it should definitely lessen the impact of using lazy loading I think. |
70b66e2
to
e37ac12
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, thanks 👍
@pzhdfy thanks for the updates, I should be able to review it early next week. however , please don't force push the changes once a PR review is started or else it is difficult for me , as reviewer, to see what changed since I last reviewed . |
great!!! |
Hi there. I have made a PR #10688. Maybe can solve the little shortcomings when historical node restart lazily. Please let me know if you have any questions! |
We have large data in druid, historical (12 * 2T SATA HDD) will have over 100k segments and 10TB size.
When we want to restart a historical to change configuration or after OOM, it will take 40 minutes , that is too slow.
We profile the restart progress, and make a flame graph.

We can see io.druid.segment.IndexIO$V9IndexLoader.deserializeColumn cost most time.
It was because HDD has only 100 IOPS, and segment often has over 100 columns, so it makes too many random I/O, the disk util gets 100%.
So we can make columns metadata lazy load , until it gets first used.
After optimize, historical restart will only spend 2 minutes( 20X faster).
And the flame graph after optimize is below, we can see load metadata spend little time.

we add a new config druid.segmentCache.lazyLoadOnStart (default is false), whether to do this optimize.
We suggest to open this on HDD historicals, while SDD historical is fast enough.