New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-22968][DStream] Throw an exception on partition revoking issue #21038
Conversation
Test build #89177 has finished for PR 21038 at commit
|
@jerryshao Thank you very much for this issue. |
@koeninger would you please help to review, thanks! |
The log in the jira looks like it's from a consumer rebalance, i.e. more than one driver consumer was running with the same group id. Isn't the underlying problem here that the user is creating multiple streams with the same group id, despite what the documentation says? The log even says s/he copy-pasted the documentation group id "group use_a_separate_group_id_for_each_stream" I don't think we should silently "fix" that. As a user, I wouldn't expect app A to suddenly start processing only half of the partitions just because entirely different app B started with the (misconfigured) same group id. |
Thanks @koeninger for your comments. I think your suggestion is valid, the log here is just pasted from JIRA, but we also got the same issue from customer's report. Here in the PR description, I mentioned that using two apps with same group id to mimic this issue. But I'm not sure the real use case from our customer, maybe in their scenario such usage is valid. So I'm wondering if we can add a configuration to control whether it should be fail or just warning. Also I think exception/warning log should be improved to directly tell user about consumer rebalance issue, rather than throwing from Kafka as "no current assignment for partition xxx". |
I can't think of a valid reason to create a configuration to allow it. It
just fundamentally doesn't make sense to run different apps with the same
group id.
Trying to catch and rethrow the exception with more information might make
sense.
…On Wed, Apr 11, 2018, 20:05 Saisai Shao ***@***.***> wrote:
Thanks @koeninger <https://github.com/koeninger> for your comments. I
think your suggestion is valid, the log here is just pasted from JIRA, but
we also got the same issue from customer's report.
Here in the PR description, I mentioned that using two apps with same
group id to mimic this issue. But I'm not sure the real use case from our
customer, maybe in their scenario such usage is valid.
So I'm wondering if we can add a configuration to control whether it
should be fail or just warning. Also I think exception/warning log should
be improved to directly tell user about consumer rebalance issue, rather
than throwing from Kafka as "no current assignment for partition xxx".
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#21038 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAGAB-6ISk53Qsrh0Hwopdc8uk-F4ZFrks5tnqhEgaJpZM4TPftQ>
.
|
Thanks @koeninger , then I will just improve the exception message. |
Test build #89230 has finished for PR 21038 at commit
|
Ping @koeninger , would you please help to review again. Thanks! |
Seems like that should help address the confusion. Merging to master. |
Thanks @koeninger for the review. |
I think "This is fundamentally not correct, different apps should use different consumer group" statement is wrong. .According to Kafka Having consumers as part of the same consumer group means providing the “competing consumers” pattern with whom the messages from topic partitions are spread across the members of the group. |
@SehanRathnayake Kafka is designed for at most one consumer per partition per consumer group at any given point in time, https://kafka.apache.org/documentation/#design_consumerposition |
@koeninger According to Kafka documentation
@SehanRathnayake Any thoughts? |
Read the Kafka documentation more closely. You can't have multiple
consumers from the same group consuming the same partition. If you have
different consumer groups, they're going to be consuming the same records.
Kafka parallelism is limited to the partition, and spark dstream partitions
are 1:1 with the Kafka partitions. If your computer per record is much
greater than the cost of reading, you can shuffle in spark after
consuming. Otherwise your only real option is to repartition Kafka.
"
Our topic is divided into a set of totally ordered partitions, each of
which is consumed by exactly one consumer within each subscribing consumer
group at any given time. "
…On Mon, Oct 14, 2019, 5:42 AM Anand Changediya ***@***.***> wrote:
@koeninger <https://github.com/koeninger> According to Kafka documentation
If all the consumer instances have the same consumer group, then the
records will effectively be load-balanced over the consumer instances
This means I can have multiple consumers with same groupId which can help
me to load balance my application and scale accordingly.
I don't know why it is said "fundamentally wrong" to have multiple
consumers with the same groupId in spark.
So how can I achieve scalability to listen to a single partition and
increase consumption rate with multiple spark consumers?
Is this the spark design fault or any other way to achieve that which I am
unaware of?
@SehanRathnayake <https://github.com/SehanRathnayake> Any thoughts?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#21038?email_source=notifications&email_token=AAAYAB54OYYJXKMEEU7UDFDQOREIXA5CNFSM4EZ57NIKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBEDXVY#issuecomment-541604823>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAYAB4LY5I3SIFUBWMVJJDQOREIXANCNFSM4EZ57NIA>
.
|
Hey @koeninger thanks for the reply When I start another instance of the same application with the same group id I can see there is rebalance in spark and one partition is assigned to the second application instance LOG as follows in the first application instance
So we can see the topic.partition 1 is assigned to the second instance of an application in the rebalancing process
And the application exits. |
Don't start another copy of the application with the same group ID. Spark
is already giving as much parallelism as possible, by having consumers on
the workers.
Have you read or watched the information linked from
https://github.com/koeninger/kafka-exactly-once
…On Mon, Oct 14, 2019, 8:54 AM Anand Changediya ***@***.***> wrote:
Hey @koeninger <https://github.com/koeninger> thanks for the reply
The issue with my application is I have one topic with 3 partitions once I
start my application (Spark consumer) it listens to all the 3 partitions
LOG found as below
Setting newly assigned partitions [topic.partition-2, topic.partition-1,
topic.partition-0]
When I start another instance of the same application with the same group
id I can see there is rebalance in spark and one partition is assigned to
the second application instance
LOG as follows in the first application instance
Setting newly assigned partitions [topic.partition-2, topic.partition-0]
So we can see the topic.partition 1 is assigned to the second instance of
an application in the rebalancing process
But just after the above-mentioned log there is an exception as follows
java.lang.IllegalStateException: Previously tracked partitions
[topic.partition-1] been revoked by Kafka because of consumer rebalance.
This is mostly due to another stream with same group id joined, please
check if there're different streaming application misconfigure to use the
same group id. Fundamentally different stream should use different group id
And the application exits.
How can I have multiple consumers with same groupId for different
partitions?
I also provided Assingmnt strategy as RoundRobin
kafkaParam.put("partition.assignment.strategy",
"org.apache.kafka.clients.consumer.RoundRobinAssignor");
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#21038?email_source=notifications&email_token=AAAYAB6VDNLNWPUDCJS2QBLQOR22ZA5CNFSM4EZ57NIKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBEYRHQ#issuecomment-541690014>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAYABZOMPDVNT6RKTH4UMLQOR22ZANCNFSM4EZ57NIA>
.
|
@koeninger Hi, we have prepared two spark streaming applications with the same group id to run respectively on different cluster for disaster recovery,the first application will failed when the second application started a few minutes later, and threw exception as: |
Why can't you use a different group id? |
If the two spark streaming applications use different group id, the data will be processed twice and the result (in hbase) will be wrong |
You already have to handle data being processed twice, or you're getting bad results in the event of a failure. |
What changes were proposed in this pull request?
Kafka partitions can be revoked when new consumers joined in the consumer group to rebalance the partitions. But current Spark Kafka connector code makes sure there's no partition revoking scenarios, so trying to get latest offset from revoked partitions will throw exceptions as JIRA mentioned.
Partition revoking happens when new consumer joined the consumer group, which means different streaming apps are trying to use same group id. This is fundamentally not correct, different apps should use different consumer group. So instead of throwing an confused exception from Kafka, improve the exception message by identifying revoked partition and directly throw an meaningful exception when partition is revoked.
Besides, this PR also fixes bugs in
DirectKafkaWordCount
, this example simply cannot be worked without the fix.How was this patch tested?
This is manually verified in local cluster, unfortunately I'm not sure how to simulate it in UT, so propose the PR without UT added.