Closed
Description
Hi
in my test environment services are restarts frequently by mesos-marathon without unregistration(SIGKILL)
so i noticed that eureka instance is in Self-preservation-mode frequently.
eureka has 65 registered services ,but numberOfRenewsPerMinThreshold is 368
that seems to be incorrect.
i found that numberOfRenewsPerMinThreshold decreased only via REST-API
com.netflix.eureka.registry.PeerAwareInstanceRegistryImpl#cancel
but when lease expired
com.netflix.eureka.registry.AbstractInstanceRegistry#evict(long)
numberOfRenewsPerMinThreshold does not decreases,
so numberOfRenewsPerMinThreshold does not correlate with real instance count and expectedNumberOfRenewsPerMin
some greps:
REGISTERD
grep "Registered instance" eureka.log| wc -l
----
286
EXPIRED
grep "Registry: expired lease for" eureka.log| wc -l
----
213
RENEWAL
grep "Current renewal" eureka.log
----
929790926 2017-10-30 10:08:11,991 INFO [ ReplicaAwareInstanceRegistry - RenewalThresholdUpdater ] c.n.e.r.PeerAwareInstanceRegistryImpl | Current renewal threshold is : 368
Activity
p-zalejko commentedon Nov 21, 2017
Hi,
I also ran into the same problem and have been investigating it for a while. Here is what I found:
The "Renews threshold" is decreased if a service is shutdown gracefully. Then, before going down it calls
com.netflix.eureka.resources.InstanceResource#cancelLease
endpoint. It executesPeerAwareInstanceRegistryImpl#cancel
method that updates "Renews threshold" value.If, for instance, a service was killed by SIGKILL then the "
cancelLease
" is not called. At some point the "evict
" method gets triggered and eventually it executesorg.springframework.cloud.netflix.eureka.server.InstanceRegistry#internalCancel
method. And there is an issue: it cancels the instance but does not call thecom.netflix.eureka.registry.PeerAwareInstanceRegistryImpl#cancel
method (which updates "Renews threshold" value).In my case, I tested this issue in the following way:
After repeating the step 3 many times I got a big "
Renews threshold
" value, even greater thanRenews (last min)
. If the self-preservation mode is enabled it can lead to problems because if the "Renews threshold
" is never decreased then even after launching back all services, all the renews sent by them might not be enough to escape from the activated "self-preservation" mode (if it has been already activated).holy12345 commentedon Nov 24, 2017
@p-zalejko I've been following this discussion, is this problem solved?
p-zalejko commentedon Nov 27, 2017
Hi, @holy12345 . I think it isn't. I tested a milestone version (2.x) and it behaves the same.
holy12345 commentedon Nov 29, 2017
@p-zalejko Hi,
This will not happen if Eureka Server self-preservation is turned off. The value of
Renews threshold
is correct.PeerAwareInstanceRegistryImpl#updateRenewalThreshold()
First of all, this scheduled task defaults to every 15 minutes. if self-preservation is off, the
Renews threshold
value is recalculated. if self-preservation is turned on, there is no problem as long as the first if statement is true,but i do not understand how to make this condition .apps.getRegisteredApplications()
is to get all registered client information?Do I understand it correctly? Best wishes.
p-zalejko commentedon Nov 29, 2017
Hi @holy12345, I think you are right:
If the self-preservation is turned off (
server.enableSelfPreservation: false
) then we do not have a problem because we do not have self-preservation ;)But if it is enabled I would expect that the
PeerAwareInstanceRegistryImpl#updateRenewalThreshold()
updated the 'expectedNumberOfRenewsPerMin
' periodically.Unfortunately, in terms of the
PeerAwareInstanceRegistryImpl#updateRenewalThreshold()
I do not know why theApplications apps = eurekaClient.getApplications()
returns an empty list of apps in my case. As a result, count is 0 so it never gets into theif
statement and update theexpectedNumberOfRenewsPerMin
. It might be a different issue, that could also solve the problem we are discussing on.holy12345 commentedon Nov 29, 2017
@p-zalejko First of all thank you for your reply :) You're right, count is 0 too in my test environment.What i am trying to do next is to understand why the count is zero, maybe solve this problem, and the problem you ask is solved.If there is progress, I will inform you first, best wishes!
holy12345 commentedon Nov 30, 2017
@p-zalejko Hello, is your eureka server one? I think it should not be a cluster. If it is not a cluster, the value actually obtained is an empty list. best wishes
yolenw commentedon Apr 26, 2018
i have to follow up on this one, does this mean we are not expected to turn on "self-preservation mode" for eureka server in a none-cluster mode? thanks. @holy12345
spencergibb commentedon Jan 28, 2019
Closing this due to inactivity. Please re-open if there's more to discuss.
Jeffrey-Hassan commentedon Aug 7, 2019
@spencergibb This issue seems to persist as of latest version (1.9.12).
I came across this as well and did some digging on the source, it seems to go to the difference between a graceful shutdown and a non-graceful shutdown (e.g. the EvictTask is removing the instances, rather than a call to cancel a lease).
Ultimately, the AbstractInstanceRegistry.evict method makes a call to "internalCancel" which does not modify the "expectedNumberOfClientsSendingRenews" count. Now we have the "count" of total instances from the update task, which is calculated in PeerAwareInstanceRegistryImpl.updateRenewalThreshold by cycling through all apps and then instances of each app and keeping a tally, with one less for the evicted instance, but the "expectedNumberOfClientsSendingRenews" does not account for that evicted instance.
For a graceful shutdown of a service, PeerAwareInstanceRegistryImpl.cancel seems to be called, which decrements "expectedNumberOfClientsSendingRenews" properly.
I would've expected the variable to be decremented regardless of whether it's an eviction from the timer task or a graceful shutdown. The best place seems to be before internalCancel is called in AbstractInstanceRegistry.evict. Thoughts?
troshko111 commentedon Jan 27, 2020
@Jeffrey-Hassan this has been fixed in
v1.9.17
.1 remaining item