Skip to content

[pulsar-sql] Make partition as internal column #4888

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Aug 10, 2019

Conversation

codelipenghui
Copy link
Contributor

Fixes #4785

Motivation

  1. Stop return partition name in table list, just return the partitioned topic name in table list. This will avoid huge tables while user create large number of partition.
  2. Make partition as internal column, provide users with the ability to get which partition data in and filtration based on partition. For example:
SELECT * FROM "my-table" WHERE "__partition__" = 0;
SELECT * FROM "my-table" WHERE "__partition__" in (2,3);
SELECT * FROM "my-table" WHERE "__partition__" < 1;

Modifications

  1. Add "partition" internal column.
  2. Add domain handle for "partition".

Verifying this change

Added new unit test to verify this change

Does this pull request potentially affect one of the following parts:

If yes was chosen, please highlight the changes

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API: (yes)
  • The schema: (no)
  • The default values of configurations: (no)
  • The wire protocol: ( no)
  • The rest endpoints: (no)
  • The admin cli options: (no)
  • Anything that affects deployment: (no)

Documentation

  • Does this pull request introduce a new feature? (yes)

@codelipenghui codelipenghui requested review from merlimat, jerrypeng and sijie and removed request for merlimat August 5, 2019 06:09
@codelipenghui codelipenghui self-assigned this Aug 5, 2019
@codelipenghui codelipenghui added the area/sql Pulsar SQL related features label Aug 5, 2019
@codelipenghui codelipenghui added this to the 2.5.0 milestone Aug 5, 2019
}

@Override
public Object getData(RawMessage message) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can get partition from the messageId, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RawMessage does not contains partition now and get partition by split will reduce memory usage more than storing a partition per RawMessage.

}

int actualNumSplits = Math.max(numPartitions, numSplits);
List<Integer> predicatePartitions = new ArrayList<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move line 152-181 to a function "getPredicatedPartitions" or "getFilteredPartitions"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@codelipenghui
Copy link
Contributor Author

run cpp tests
run integration tests

@sijie
Copy link
Member

sijie commented Aug 5, 2019

@codelipenghui I think the tests are failing due to this pull request.


Test Name | Duration | Age
-- | -- | --
org.apache.pulsar.sql.presto.TestPulsarSplitManager.testPartitionedTopic | 55 ms | 2
org.apache.pulsar.sql.presto.TestPulsarSplitManager.testPartitionedTopic | 36 ms | 2
org.apache.pulsar.sql.presto.TestPulsarSplitManager.testPartitionedTopic | 46 ms | 2
org.apache.pulsar.sql.presto.TestPulsarSplitManager.testPublishTimePredicatePushdownPartitionedTopic | 31 ms | 2
org.apache.pulsar.sql.presto.TestPulsarSplitManager.testPublishTimePredicatePushdownPartitionedTopic | 30 ms | 2
org.apache.pulsar.sql.presto.TestPulsarSplitManager.testPublishTimePredicatePushdownPartitionedTopic | 29 ms | 2
org.apache.pulsar.sql.presto.TestPulsarSplitManager.testTopic | 0.17 sec | 2
org.apache.pulsar.sql.presto.TestPulsarSplitManager.testTopic | 0.15 sec | 2
org.apache.pulsar.sql.presto.TestPulsarSplitManager.testTopic | 0.15 sec | 2


Error Message
java.lang.ArithmeticException: / by zero
Stacktrace
java.lang.RuntimeException: java.lang.ArithmeticException: / by zero
	at org.apache.pulsar.sql.presto.PulsarSplitManager.getSplits(PulsarSplitManager.java:132)
	at org.apache.pulsar.sql.presto.TestPulsarSplitManager.testPartitionedTopic(TestPulsarSplitManager.java:136)
	at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:124)
	at org.testng.internal.InvokeMethodRunnable.runOne(InvokeMethodRunnable.java:54)
	at org.testng.internal.InvokeMethodRunnable.run(InvokeMethodRunnable.java:44)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ArithmeticException: / by zero
	at org.apache.pulsar.sql.presto.PulsarSplitManager.getSplitsPartitionedTopic(PulsarSplitManager.java:148)
	at org.apache.pulsar.sql.presto.TestPulsarSplitManager$ResultCaptor.answer(TestPulsarSplitManager.java:69)
	at org.apache.pulsar.sql.presto.PulsarSplitManager.getSplits(PulsarSplitManager.java:127)
	... 9 more

@@ -388,4 +431,8 @@ public boolean apply(Entry entry) {
}
});
}

public PulsarAdmin getPulsarAdmin() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this?

high = range.getHigh().getValueBlock().get().getInt(0, 0);
}
while (low <= high) {
if (!predicatePartitions.contains(low)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this conditional check?

if (!range.getHigh().isLowerUnbounded() && range.getHigh().getValueBlock().isPresent()) {
high = range.getHigh().getValueBlock().get().getInt(0, 0);
}
while (low <= high) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simpler and easier to read if we just use a for loop e.g.

for (int i = low; i <= high; i++)

@@ -91,6 +92,7 @@
// are empty or not
private final long splitSize;
private long entriesProcessed = 0;
private int partition = -1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For non-partitioned topics, should the partition column be NULL? Currently, for internal columns, if there is not an applicable value, the value is displayed as "NULL"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to be consistent with the messageId? If not, agree use NULL for non-partitioned topics.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

personally I am in favor of using -1 to be consistent with how we use it in MessageId and it make a consistent view for both partitioned topic and non-partitioned topic.

A partitioned topic has N partitions, a non-partitioned topic has one partition with partition id "-1".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok with either. @merlimat any opinion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after discussing with @karthikramasamy since he has experience with databases, it seems that "null" can have all sort of meanings. Perhaps its just better to use -1 for non-partitioned topics

Copy link
Contributor

@jerrypeng jerrypeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good! I left a few comments

@codelipenghui
Copy link
Contributor Author

@jerrypeng I have addressed your comments, please take a look.
@sijie I have fix the unit test issue, please take a look.

@codelipenghui
Copy link
Contributor Author

run cpp tests
run integration tests
run java8 tests

}
for (int i = low; i <= high; i++) {
predicatePartitions.add(i);
low++;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to do this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, fixed

@codelipenghui
Copy link
Contributor Author

run java8 tests
run cpp tests
run integration tests

@sijie sijie merged commit 5adc522 into apache:master Aug 10, 2019
@codelipenghui codelipenghui deleted the sql_partition branch August 10, 2019 01:53
@jiazhai jiazhai modified the milestones: 2.5.0, 2.4.1 Aug 28, 2019
@jiazhai
Copy link
Member

jiazhai commented Aug 28, 2019

conflict with #4890 mark as 2.4.1

jiazhai pushed a commit that referenced this pull request Aug 28, 2019
Fixes #4785

### Motivation

1. Stop return partition name in table list, just return the partitioned topic name in table list. This will avoid huge tables while user create large number of partition.
2. Make partition as internal column, provide users with the ability to get which partition data in and filtration based on partition. For example:
```
SELECT * FROM "my-table" WHERE "__partition__" = 0;
SELECT * FROM "my-table" WHERE "__partition__" in (2,3);
SELECT * FROM "my-table" WHERE "__partition__" < 1;
```
### Modifications

1. Add "__partition__" internal column.
2. Add domain handle for "__partition__".

### Verifying this change

Added new unit test to verify this change

### Does this pull request potentially affect one of the following parts:

*If `yes` was chosen, please highlight the changes*

  - Dependencies (does it add or upgrade a dependency): (no)
  - The public API: (yes)
  - The schema: (no)
  - The default values of configurations: (no)
  - The wire protocol: ( no)
  - The rest endpoints: (no)
  - The admin cli options: (no)
  - Anything that affects deployment: (no)

### Documentation

  - Does this pull request introduce a new feature? (yes)

(cherry picked from commit 5adc522)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/sql Pulsar SQL related features
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[pulsar-sql] export partitioned topic name
4 participants