[SPARK-22968][DStream] Throw an exception on partition revoking issue #21038

jerryshao · 2018-04-11T07:19:59Z

What changes were proposed in this pull request?

Kafka partitions can be revoked when new consumers joined in the consumer group to rebalance the partitions. But current Spark Kafka connector code makes sure there's no partition revoking scenarios, so trying to get latest offset from revoked partitions will throw exceptions as JIRA mentioned.

Partition revoking happens when new consumer joined the consumer group, which means different streaming apps are trying to use same group id. This is fundamentally not correct, different apps should use different consumer group. So instead of throwing an confused exception from Kafka, improve the exception message by identifying revoked partition and directly throw an meaningful exception when partition is revoked.

Besides, this PR also fixes bugs in DirectKafkaWordCount, this example simply cannot be worked without the fix.

8/01/05 09:48:27 INFO internals.ConsumerCoordinator: Revoking previously assigned partitions [kssh-7, kssh-4, kssh-3, kssh-6, kssh-5, kssh-0, kssh-2, kssh-1] for group use_a_separate_group_id_for_each_stream
18/01/05 09:48:27 INFO internals.AbstractCoordinator: (Re-)joining group use_a_separate_group_id_for_each_stream
18/01/05 09:48:27 INFO internals.AbstractCoordinator: Successfully joined group use_a_separate_group_id_for_each_stream with generation 4
18/01/05 09:48:27 INFO internals.ConsumerCoordinator: Setting newly assigned partitions [kssh-7, kssh-4, kssh-6, kssh-5] for group use_a_separate_group_id_for_each_stream

How was this patch tested?

This is manually verified in local cluster, unfortunately I'm not sure how to simulate it in UT, so propose the PR without UT added.

SparkQA · 2018-04-11T07:41:34Z

Test build #89177 has finished for PR 21038 at commit f317dec.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Hackeruncle · 2018-04-11T08:21:56Z

@jerryshao Thank you very much for this issue.
I go to compile and test.

jerryshao · 2018-04-11T11:14:32Z

@koeninger would you please help to review, thanks!

koeninger · 2018-04-11T15:34:50Z

The log in the jira looks like it's from a consumer rebalance, i.e. more than one driver consumer was running with the same group id.

Isn't the underlying problem here that the user is creating multiple streams with the same group id, despite what the documentation says? The log even says s/he copy-pasted the documentation group id "group use_a_separate_group_id_for_each_stream"

I don't think we should silently "fix" that. As a user, I wouldn't expect app A to suddenly start processing only half of the partitions just because entirely different app B started with the (misconfigured) same group id.

jerryshao · 2018-04-12T01:03:15Z

Thanks @koeninger for your comments. I think your suggestion is valid, the log here is just pasted from JIRA, but we also got the same issue from customer's report.

Here in the PR description, I mentioned that using two apps with same group id to mimic this issue. But I'm not sure the real use case from our customer, maybe in their scenario such usage is valid.

So I'm wondering if we can add a configuration to control whether it should be fail or just warning. Also I think exception/warning log should be improved to directly tell user about consumer rebalance issue, rather than throwing from Kafka as "no current assignment for partition xxx".

koeninger · 2018-04-12T01:13:08Z

I can't think of a valid reason to create a configuration to allow it. It just fundamentally doesn't make sense to run different apps with the same group id. Trying to catch and rethrow the exception with more information might make sense.

…

On Wed, Apr 11, 2018, 20:05 Saisai Shao ***@***.***> wrote: Thanks @koeninger <https://github.com/koeninger> for your comments. I think your suggestion is valid, the log here is just pasted from JIRA, but we also got the same issue from customer's report. Here in the PR description, I mentioned that using two apps with same group id to mimic this issue. But I'm not sure the real use case from our customer, maybe in their scenario such usage is valid. So I'm wondering if we can add a configuration to control whether it should be fail or just warning. Also I think exception/warning log should be improved to directly tell user about consumer rebalance issue, rather than throwing from Kafka as "no current assignment for partition xxx". — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#21038 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAGAB-6ISk53Qsrh0Hwopdc8uk-F4ZFrks5tnqhEgaJpZM4TPftQ> .

jerryshao · 2018-04-12T01:35:09Z

Thanks @koeninger , then I will just improve the exception message.

SparkQA · 2018-04-12T02:18:59Z

Test build #89230 has finished for PR 21038 at commit a7770e9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jerryshao · 2018-04-18T00:59:46Z

Ping @koeninger , would you please help to review again. Thanks!

koeninger · 2018-04-18T02:05:07Z

Seems like that should help address the confusion. Merging to master.

jerryshao · 2018-04-18T02:26:36Z

Thanks @koeninger for the review.

SehanRathnayake · 2018-11-07T13:34:08Z

I think "This is fundamentally not correct, different apps should use different consumer group" statement is wrong. .According to Kafka Having consumers as part of the same consumer group means providing the “competing consumers” pattern with whom the messages from topic partitions are spread across the members of the group.
It seams Spark-Fafka cannot support this for now

koeninger · 2018-11-07T14:19:23Z

@SehanRathnayake Kafka is designed for at most one consumer per partition per consumer group at any given point in time, https://kafka.apache.org/documentation/#design_consumerposition
Spark already manages creating a consumer per partition for the consumer group associated with a stream.
If you have a valid use case for running multiple Spark applications with the same consumer group, please explain it in a jira, not discussion of a pull request that has already been merged.

anandchangediya · 2019-10-14T10:40:23Z

@koeninger According to Kafka documentation

If all the consumer instances have the same consumer group, then the records will effectively be load-balanced over the consumer instances
This means I can have multiple consumers with the same groupId which can help me to load balance my application and scale accordingly.
I don't know why it is said "fundamentally wrong" to have multiple consumers with the same groupId in spark.
So how can I achieve scalability to listen to a single partition and increase consumption rate with multiple spark consumers?
Is this the spark design fault or any other way to achieve scaling which I am unaware of?

@SehanRathnayake Any thoughts?

koeninger · 2019-10-14T12:57:23Z

Read the Kafka documentation more closely. You can't have multiple consumers from the same group consuming the same partition. If you have different consumer groups, they're going to be consuming the same records. Kafka parallelism is limited to the partition, and spark dstream partitions are 1:1 with the Kafka partitions. If your computer per record is much greater than the cost of reading, you can shuffle in spark after consuming. Otherwise your only real option is to repartition Kafka. " Our topic is divided into a set of totally ordered partitions, each of which is consumed by exactly one consumer within each subscribing consumer group at any given time. "

…

On Mon, Oct 14, 2019, 5:42 AM Anand Changediya ***@***.***> wrote: @koeninger <https://github.com/koeninger> According to Kafka documentation If all the consumer instances have the same consumer group, then the records will effectively be load-balanced over the consumer instances This means I can have multiple consumers with same groupId which can help me to load balance my application and scale accordingly. I don't know why it is said "fundamentally wrong" to have multiple consumers with the same groupId in spark. So how can I achieve scalability to listen to a single partition and increase consumption rate with multiple spark consumers? Is this the spark design fault or any other way to achieve that which I am unaware of? @SehanRathnayake <https://github.com/SehanRathnayake> Any thoughts? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#21038?email_source=notifications&email_token=AAAYAB54OYYJXKMEEU7UDFDQOREIXA5CNFSM4EZ57NIKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBEDXVY#issuecomment-541604823>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAYAB4LY5I3SIFUBWMVJJDQOREIXANCNFSM4EZ57NIA> .

anandchangediya · 2019-10-14T13:51:24Z

Hey @koeninger thanks for the reply
The issue with my application is I have one topic with 3 partitions once I start my application (Spark consumer) it listens to all the 3 partitions
LOG found as below
Setting newly assigned partitions [topic.partition-2, topic.partition-1, topic.partition-0]

When I start another instance of the same application with the same group id I can see there is rebalance in spark and one partition is assigned to the second application instance

LOG as follows in the first application instance

Setting newly assigned partitions [topic.partition-2, topic.partition-0]

So we can see the topic.partition 1 is assigned to the second instance of an application in the rebalancing process
But just after the above-mentioned log there is an exception as follows

java.lang.IllegalStateException: Previously tracked partitions [topic.partition-1] been revoked by Kafka because of consumer rebalance. This is mostly due to another stream with same group id joined, please check if there're different streaming application misconfigure to use the same group id. Fundamentally different stream should use different group id

And the application exits.
How can I have multiple consumers with same groupId for different partitions?
I also provided Assingmnt strategy as RoundRobin
kafkaParam.put("partition.assignment.strategy", "org.apache.kafka.clients.consumer.RoundRobinAssignor");

koeninger · 2019-10-14T16:54:56Z

Don't start another copy of the application with the same group ID. Spark is already giving as much parallelism as possible, by having consumers on the workers. Have you read or watched the information linked from https://github.com/koeninger/kafka-exactly-once

…

On Mon, Oct 14, 2019, 8:54 AM Anand Changediya ***@***.***> wrote: Hey @koeninger <https://github.com/koeninger> thanks for the reply The issue with my application is I have one topic with 3 partitions once I start my application (Spark consumer) it listens to all the 3 partitions LOG found as below Setting newly assigned partitions [topic.partition-2, topic.partition-1, topic.partition-0] When I start another instance of the same application with the same group id I can see there is rebalance in spark and one partition is assigned to the second application instance LOG as follows in the first application instance Setting newly assigned partitions [topic.partition-2, topic.partition-0] So we can see the topic.partition 1 is assigned to the second instance of an application in the rebalancing process But just after the above-mentioned log there is an exception as follows java.lang.IllegalStateException: Previously tracked partitions [topic.partition-1] been revoked by Kafka because of consumer rebalance. This is mostly due to another stream with same group id joined, please check if there're different streaming application misconfigure to use the same group id. Fundamentally different stream should use different group id And the application exits. How can I have multiple consumers with same groupId for different partitions? I also provided Assingmnt strategy as RoundRobin kafkaParam.put("partition.assignment.strategy", "org.apache.kafka.clients.consumer.RoundRobinAssignor"); — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#21038?email_source=notifications&email_token=AAAYAB6VDNLNWPUDCJS2QBLQOR22ZA5CNFSM4EZ57NIKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBEYRHQ#issuecomment-541690014>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAYABZOMPDVNT6RKTH4UMLQOR22ZANCNFSM4EZ57NIA> .

joanjiao2016 · 2020-07-14T07:32:51Z

@koeninger Hi， we have prepared two spark streaming applications with the same group id to run respectively on different cluster for disaster recovery，the first application will failed when the second application started a few minutes later, and threw exception as:
java.lang.IllegalStateException: No current assignment for partition xxx

koeninger · 2020-07-14T15:06:50Z

Why can't you use a different group id?

joanjiao2016 · 2020-07-15T02:50:40Z

Why can't you use a different group id?

If the two spark streaming applications use different group id, the data will be processed twice and the result (in hbase) will be wrong

koeninger · 2020-07-15T15:36:46Z

You already have to handle data being processed twice, or you're getting bad results in the event of a failure.

Fix Kafka connector partition revoked issue

f317dec

jerryshao changed the title ~~[SPARK-22968][DStream] Fix Kafka connector partition revoked issue~~ [SPARK-22968][DStream] Fix Kafka partition revoked issue Apr 11, 2018

thrown an exception when partition is revoked due to consumer rebalance

a7770e9

jerryshao changed the title ~~[SPARK-22968][DStream] Fix Kafka partition revoked issue~~ [SPARK-22968][DStream] Throw an exception on partition revoking issue Apr 12, 2018

asfgit closed this in 5fccdae Apr 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-22968][DStream] Throw an exception on partition revoking issue #21038

[SPARK-22968][DStream] Throw an exception on partition revoking issue #21038

jerryshao commented Apr 11, 2018 •

edited

SparkQA commented Apr 11, 2018

Hackeruncle commented Apr 11, 2018

jerryshao commented Apr 11, 2018

koeninger commented Apr 11, 2018

jerryshao commented Apr 12, 2018

koeninger commented Apr 12, 2018 via email

jerryshao commented Apr 12, 2018

SparkQA commented Apr 12, 2018

jerryshao commented Apr 18, 2018

koeninger commented Apr 18, 2018

jerryshao commented Apr 18, 2018

SehanRathnayake commented Nov 7, 2018

koeninger commented Nov 7, 2018

anandchangediya commented Oct 14, 2019 •

edited

koeninger commented Oct 14, 2019 via email

anandchangediya commented Oct 14, 2019

koeninger commented Oct 14, 2019 via email

joanjiao2016 commented Jul 14, 2020

koeninger commented Jul 14, 2020

joanjiao2016 commented Jul 15, 2020

koeninger commented Jul 15, 2020

[SPARK-22968][DStream] Throw an exception on partition revoking issue #21038

[SPARK-22968][DStream] Throw an exception on partition revoking issue #21038

Conversation

jerryshao commented Apr 11, 2018 • edited

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Apr 11, 2018

Hackeruncle commented Apr 11, 2018

jerryshao commented Apr 11, 2018

koeninger commented Apr 11, 2018

jerryshao commented Apr 12, 2018

koeninger commented Apr 12, 2018 via email

jerryshao commented Apr 12, 2018

SparkQA commented Apr 12, 2018

jerryshao commented Apr 18, 2018

koeninger commented Apr 18, 2018

jerryshao commented Apr 18, 2018

SehanRathnayake commented Nov 7, 2018

koeninger commented Nov 7, 2018

anandchangediya commented Oct 14, 2019 • edited

koeninger commented Oct 14, 2019 via email

anandchangediya commented Oct 14, 2019

koeninger commented Oct 14, 2019 via email

joanjiao2016 commented Jul 14, 2020

koeninger commented Jul 14, 2020

joanjiao2016 commented Jul 15, 2020

koeninger commented Jul 15, 2020

jerryshao commented Apr 11, 2018 •

edited

anandchangediya commented Oct 14, 2019 •

edited