Skip to content

[flink] Assign splits with fixed batch size in StaticFileStoreSplitEnumerator #687

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Mar 24, 2023

Conversation

wxplovecc
Copy link
Contributor

@wxplovecc wxplovecc commented Mar 22, 2023

Purpose

Assign all splits in one time will produce exceed akka.framesize to exceptions
Add scan.bounded.splits.size config option to detemine how many splits will be assign per batch

Tests

StaticFileStoreSplitEnumeratorTest.testSplitBatch

Verified

This commit was signed with the committer’s verified signature.

Verified

This commit was signed with the committer’s verified signature.
@@ -79,10 +82,15 @@ public void handleSplitRequest(int subtask, @Nullable String hostname) {
// The following batch assignment operation is for two purposes:
// To distribute splits evenly when batch reading to prevent a few tasks from reading all
// the data (for example, the current resource can only schedule part of the tasks).
// TODO: assignment is already created in constructor, here can just assign per batch
List<FileStoreSourceSplit> splits = pendingSplitAssignment.remove(subtask);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make pendingSplitAssignment to Map<Integer, Queue<FileStoreSourceSplit>>?
And poll a batch from it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@@ -37,6 +37,9 @@
public class StaticFileStoreSplitEnumerator
implements SplitEnumerator<FileStoreSourceSplit, PendingSplitsCheckpoint> {

/** Default batch splits size to avoid exceed `akka.framesize`. */
private static final int DEFAULT_SPLITS_SIZE = 10;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we introduce an option for this? Can be an option in FlinkConnectorOptions?

@@ -26,6 +26,12 @@
<td>String</td>
<td>The log system used to keep changes of the table.<br /><br />Possible values:<br /><ul><li>"none": No log system, the data is written only to file store, and the streaming read will be directly read from the file store.</li></ul><ul><li>"kafka": Kafka log system, the data is double written to file store and kafka, and the streaming read will be read from kafka.</li></ul></td>
</tr>
<tr>
<td><h5>scan.bounded.splits.size</h5></td>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scan.split-enumerator.batch-size?

@@ -94,6 +98,6 @@ public SplitEnumerator<FileStoreSourceSplit, PendingSplitsCheckpoint> restoreEnu
}

Snapshot snapshot = snapshotId == null ? null : snapshotManager.snapshot(snapshotId);
return new StaticFileStoreSplitEnumerator(context, snapshot, splits);
return new StaticFileStoreSplitEnumerator(context, snapshot, splits, splitsSize);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, we can get splitBatchSize from table. (via table.options())

List<FileStoreSourceSplit> splits = pendingSplitAssignment.remove(subtask);
if (splits != null && splits.size() > 0) {
context.assignSplits(new SplitsAssignment<>(Collections.singletonMap(subtask, splits)));
Queue<FileStoreSourceSplit> allSubTaskSplits = pendingSplitAssignment.get(subtask);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: rename to taskSplits

if (splits != null && splits.size() > 0) {
context.assignSplits(new SplitsAssignment<>(Collections.singletonMap(subtask, splits)));
Queue<FileStoreSourceSplit> allSubTaskSplits = pendingSplitAssignment.get(subtask);
List<FileStoreSourceSplit> batchAssignment = new ArrayList<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: rename to assignment?

Copy link
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wxplovecc , +1 for the pr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants