Skip to content

(improvement)[bucket] Add auto bucket implement #15250

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Jan 18, 2023

Conversation

JackDrogon
Copy link
Contributor

@JackDrogon JackDrogon commented Dec 21, 2022

Proposed changes

Problem summary

用户经常设置不合适的bucket,导致各种问题,这里提供一种方式,来自动设置分桶数。暂时而言只对olap表生效

实现思路

根据数据量,计算分桶数。
对于分区表,可以根据历史分区的数据量、机器数、盘数,确定一个分桶。
主要问题是初始桶数不好确定。
这里提供两种方式:

  1. 根据机器数、盘数,确定一个分桶数
  2. 用户可以提供一个数据量的经验值,根据这个值,确定分桶数。

详细设计

  1. 建表
create table tbl1
(...)
[PARTITION BY RANGE(...)]
DISTRIBUTED BY HASH(k1) BUCKETS AUTO
properties(
    "estimate_partition_size" = "100G"
)
  • BUCKETS AUTO 表示自动设定buckets
  • estimate_partition_size:可选参数,提供一个单分区初始数据量。
  1. 分桶计算逻辑
    初始分桶计算
  • 没有给 estimate_partition_size
    这种基本上不太靠谱。使用了default_bucket_num(10)。
  • 给了 estimate_partition_size

这里我们先假设给的是单副本文本格式的数据量

  1. 先根据数据量得出一个桶数:N
    首先数据量除以5(按5比1的压缩比算)
    < 100MB : 1
    < 1G: 2

    1G: 每1G一个分桶。

  2. 根据桶数和盘数的乘机得出一个桶数 M
    每个BE节点算1
    磁盘容量,每50G算1

  3. min(M, N, 128),如果这个值小于N,也小于机器数。取机器数。

举例:

1. 先根据数据量得出一个桶数:N
    首先数据量除以5(按5比1的压缩比算)
    < 100MB : 1
    < 1G: 2
    > 1G:  每1G一个分桶。

2. 根据BE数和盘数的乘机得出一个桶数 M
    每个BE节点算1
    磁盘容量,每50G算1
    
3. min(M, N, 128),如果这个值小于N,也小于机器数。
取机器数。这里就是

举例:
1. 100MB,10台机器,2T * 3盘 = 1
数据量: 1
BE磁盘: 10 * 40 * 3 = 1200
min计算: 1
最终: 1

2. 1G, 3台机器,500GB * 2盘 = 2
数据量: 2
BE磁盘: 3 * 10 * 2 = 60
min计算: 2
最终: 2

3. 100G,3台机器,500GB * 2盘 = 20
数据量: 20
BE磁盘: 3 * 10 * 2 = 60
min计算: 20
最终: 20

4. 500G,3台机器,1T * 1盘 = 60
数据量: 100
BE磁盘: 3 * 21 * 1 = 63
min计算: 63
最终: 63

5. 500G,10台机器,2T * 3盘 = 128
数据量: 500
BE磁盘: 10 * 41 * 3 = 1230
min计算: 100
最终: 100

6. 1T,10台机器,2T * 3盘 = 128 
数据量: 200
BE磁盘: 10 * 41 * 3 = 1230
min计算: 128
最终: 128

7. 500G,1台机器,100TB * 1盘 = 128
数据量: 100
BE磁盘: 1 * 2048 * 1 = 2048
min计算: 100
最终: 100

8. 1TB, 200台机器,4T * 7盘 = 200
数据量: 205
BE磁盘: 200 * 80 * 7 = 112000
min计算: 128
最终: 200

计算未来分桶
仅针对分区表。
根据最多前7个分区的数据量的指数平均值,作为estimate_partition_size,进行评估。
需要判断历史分区的趋势:
比如前五个分区,每个都比前一个大,说明数据再增长,则此时不能求平均值,而应该取趋势值。
仅考虑递增和递减的情况。其他情况,求平均。

Checklist(Required)

  1. Does it affect the original behavior:
    • Yes
    • No
    • I don't know
  2. Has unit tests been added:
    • Yes
    • No
    • No Need
  3. Has document been added or modified:
    • Yes
    • No
    • No Need
  4. Does it need to update dependencies:
    • Yes
    • No
  5. Are there any changes that cannot be rolled back:
    • Yes (If Yes, please explain WHY)
    • No

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

Sorry, something went wrong.

@github-actions github-actions bot added the area/planner Issues or PRs related to the query planner label Dec 21, 2022
@JackDrogon JackDrogon force-pushed the feature/autobucket branch 2 times, most recently from 5ad9d10 to 62117bf Compare December 21, 2022 12:05
@hello-stephen
Copy link
Contributor

hello-stephen commented Dec 21, 2022

TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 34.96 seconds
load time: 498 seconds
storage size: 17120651505 Bytes
https://doris-community-test-1308700295.cos.ap-hongkong.myqcloud.com/tmp/20230113131233_clickbench_pr_79952.html

@JackDrogon JackDrogon force-pushed the feature/autobucket branch 2 times, most recently from 5ae2d53 to e136c9d Compare December 21, 2022 20:16
@morningman morningman self-requested a review December 22, 2022 01:13
@JackDrogon
Copy link
Contributor Author

Need to add show create table with autobucket && estimate_partition_size settings

@morningman
Copy link
Contributor

Better using BUCKETS AUTO instead of BUCKETS 0.
You can return 0 internally.

Copy link
Contributor

@morningman morningman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add some unit tests and regression tests for this


int buckets = 0;
for (Backend backend : backends.values()) {
if (!backend.isLoadAvailable()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why judge isLoadAvailable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If backend is not loadAvailable,it would not be treated as a machine that could take on data.

@dataroaring dataroaring changed the title [feature] Add auto bucket implement (improvement)[bucket] Add auto bucket implement Dec 24, 2022
JackDrogon added a commit to JackDrogon/doris that referenced this pull request Dec 25, 2022

Verified

This commit was signed with the committer’s verified signature.
miguelgrinberg Miguel Grinberg
JackDrogon added a commit to JackDrogon/doris that referenced this pull request Dec 25, 2022
JackDrogon added a commit to JackDrogon/doris that referenced this pull request Dec 25, 2022
JackDrogon added a commit to JackDrogon/doris that referenced this pull request Dec 27, 2022
table.getPartitions() in DynamicPartitionScheduler::getBucketsNum (apache#15250)
JackDrogon added a commit to JackDrogon/doris that referenced this pull request Jan 5, 2023
JackDrogon added a commit to JackDrogon/doris that referenced this pull request Jan 5, 2023
JackDrogon added a commit to JackDrogon/doris that referenced this pull request Jan 9, 2023
JackDrogon added a commit to JackDrogon/doris that referenced this pull request Jan 9, 2023
JackDrogon added a commit to JackDrogon/doris that referenced this pull request Jan 13, 2023
JackDrogon added a commit to JackDrogon/doris that referenced this pull request Jan 13, 2023
Copy link
Contributor

@morningman morningman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

int buckets = 0;
for (Backend backend : backends.values()) {
if (!backend.isLoadAvailable()) {
break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
break;
continue;

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Jan 16, 2023
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

@morningman morningman merged commit 3407536 into apache:master Jan 18, 2023
dutyu pushed a commit to dutyu/doris that referenced this pull request Feb 1, 2023
@enterwhat
Copy link

how can it support Colocation Join ?

@perid007
Copy link

how can it support Colocation Join ?

+1 same question

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by one committer. area/planner Issues or PRs related to the query planner dev/1.2.2-merged kind/test reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants