-
Notifications
You must be signed in to change notification settings - Fork 2.4k
[HUDI-5671] BucketIndexPartitioner partition algorithm skew #7815
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
+6
−4
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
同步 hudi master
danny0405
approved these changes
Feb 3, 2023
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, nice improvement.
yihua
pushed a commit
that referenced
this pull request
Feb 4, 2023
neverdizzy
added a commit
to neverdizzy/hudi
that referenced
this pull request
Feb 8, 2023
neverdizzy
added a commit
to neverdizzy/hudi
that referenced
this pull request
Feb 9, 2023
XuQianJin-Stars
pushed a commit
that referenced
this pull request
Feb 11, 2023
nsivabalan
pushed a commit
to nsivabalan/hudi
that referenced
this pull request
Mar 22, 2023
fengjian428
pushed a commit
to fengjian428/hudi
that referenced
this pull request
Apr 5, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Change Logs
The online job runs for 13 days and finds that there are subtasks but no data processing, as shown in the figure below, this job uses the update time as the partition, uses the bucket index, the number of buckets is 128, and the write parallelism is 128. The key is uniform because the file size of each bucket is not much different from the storage point of view. After positioning, there is a skew in the shuffle algorithm.
Potential disadvantages of algorithmic tilt:
current algorithm:

Algorithm flaws:
Algorithm optimization:
kb = key % b; kb ∈ [0, b-1] pw = pt % w;
pw ∈ [0, w-1] shuffleIndex = (pw + kb) % w
shuffleIndex ∈ [0, w-1]
In fact, it is to calculate a pw according to the partition first. Pw can be understood as a slot Wn allocated to the partition. Different partitions have a slot.
Then move b slots back on the basis of this slot as the writing of data for this partition
Impact
NA
Risk level (write none, low medium or high below)
NA
Documentation Update
NA
ticket number here and follow the instruction to make
changes to the website.
Contributor's checklist