New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ILM] Shrink action may allocate shards to excluded nodes #64529
Comments
Pinging @elastic/es-core-features (:Core/Features/ILM+SLM) |
By debuging the code, I found that we init a new Line 49 in f590d4b
Line 86 in f590d4b
|
Can we construct a local
|
@gaobinlong yes I think that is a better solution for this (recreating the deciders in the step body) |
Relates to #64529. Currently the cluster filter variables `clusterRequireFilters` , `clusterIncludeFilters` and `clusterExcludeFilters` are non-static, so the new instance of `FilterAllocationDecider` inited in `SetSingleNodeAllocateStep ` in ILM cannot see the changes when updating the `cluster.routing.allocation.exclude._x` settings, and finally ILM will stuck in the shrink action if one excluded node has been selected in the `SetSingleNodeAllocateStep`. `AllocationRoutedStep` has the same issue. This main changes are: 1. Create `AllocationDeciders ` in the main method of `SetSingleNodeAllocateStep ` and `AllocationRoutedStep`, and the `FilterAllocationDecider ` is constructed using the cluster settings in the cluster metadata, so the cluster level filters can be seen when executing the steps. 2. Add some tests for the change.
…#65037) Relates to elastic#64529. Currently the cluster filter variables `clusterRequireFilters` , `clusterIncludeFilters` and `clusterExcludeFilters` are non-static, so the new instance of `FilterAllocationDecider` inited in `SetSingleNodeAllocateStep ` in ILM cannot see the changes when updating the `cluster.routing.allocation.exclude._x` settings, and finally ILM will stuck in the shrink action if one excluded node has been selected in the `SetSingleNodeAllocateStep`. `AllocationRoutedStep` has the same issue. This main changes are: 1. Create `AllocationDeciders ` in the main method of `SetSingleNodeAllocateStep ` and `AllocationRoutedStep`, and the `FilterAllocationDecider ` is constructed using the cluster settings in the cluster metadata, so the cluster level filters can be seen when executing the steps. 2. Add some tests for the change.
…#65037) Relates to elastic#64529. Currently the cluster filter variables `clusterRequireFilters` , `clusterIncludeFilters` and `clusterExcludeFilters` are non-static, so the new instance of `FilterAllocationDecider` inited in `SetSingleNodeAllocateStep ` in ILM cannot see the changes when updating the `cluster.routing.allocation.exclude._x` settings, and finally ILM will stuck in the shrink action if one excluded node has been selected in the `SetSingleNodeAllocateStep`. `AllocationRoutedStep` has the same issue. This main changes are: 1. Create `AllocationDeciders ` in the main method of `SetSingleNodeAllocateStep ` and `AllocationRoutedStep`, and the `FilterAllocationDecider ` is constructed using the cluster settings in the cluster metadata, so the cluster level filters can be seen when executing the steps. 2. Add some tests for the change.
This was resolved by @gaobinlong in #65037, so I'm closing this for now. |
…#65037) Relates to elastic#64529. Currently the cluster filter variables `clusterRequireFilters` , `clusterIncludeFilters` and `clusterExcludeFilters` are non-static, so the new instance of `FilterAllocationDecider` inited in `SetSingleNodeAllocateStep ` in ILM cannot see the changes when updating the `cluster.routing.allocation.exclude._x` settings, and finally ILM will stuck in the shrink action if one excluded node has been selected in the `SetSingleNodeAllocateStep`. `AllocationRoutedStep` has the same issue. This main changes are: 1. Create `AllocationDeciders ` in the main method of `SetSingleNodeAllocateStep ` and `AllocationRoutedStep`, and the `FilterAllocationDecider ` is constructed using the cluster settings in the cluster metadata, so the cluster level filters can be seen when executing the steps. 2. Add some tests for the change.
…#65037) Relates to elastic#64529. Currently the cluster filter variables `clusterRequireFilters` , `clusterIncludeFilters` and `clusterExcludeFilters` are non-static, so the new instance of `FilterAllocationDecider` inited in `SetSingleNodeAllocateStep ` in ILM cannot see the changes when updating the `cluster.routing.allocation.exclude._x` settings, and finally ILM will stuck in the shrink action if one excluded node has been selected in the `SetSingleNodeAllocateStep`. `AllocationRoutedStep` has the same issue. This main changes are: 1. Create `AllocationDeciders ` in the main method of `SetSingleNodeAllocateStep ` and `AllocationRoutedStep`, and the `FilterAllocationDecider ` is constructed using the cluster settings in the cluster metadata, so the cluster level filters can be seen when executing the steps. 2. Add some tests for the change.
Elasticsearch version (
bin/elasticsearch --version
): 7.10.0 (and prior at least to 7.8.0)JVM version (
java -version
):openjdk version "12.0.2" 2019-07-16
OpenJDK Runtime Environment (build 12.0.2+10)
OpenJDK 64-Bit Server VM (build 12.0.2+10, mixed mode, sharing)
OS version (
uname -a
if on a Unix-like system):Darwin 19.6.0 Darwin Kernel Version 19.6.0: Thu Jun 18 20:49:00 PDT 2020; root:xnu-6153.141.1~1/RELEASE_X86_64 x86_64
Description of the problem including expected versus actual behavior:
Given the following two configurations:
cluster.routing.allocation.exclude._host: [ node2.dev ]
MyPolicy
Shards belonging to indices being managed with
MyPolicy
may still be assigned to nodes that are excluded from allocation at the cluster level. This seems to specifically be something wrong in theSetSingleNodeAllocateStep
of ILM when performing the shrink action.This step sets index setting
settings.index.routing.allocation.require._id
to the id of a disallowed node and then ILM is no longer able to perform the rest of the shrink action.Steps to reproduce:
Start two nodes with:
bin/elasticsearch -Enetwork.host=node1.dev -Ehttp.port=9221 -Epath.data=dir1/data -Epath.logs=dir1/logs
bin/elasticsearch -Enetwork.host=node2.dev -Ehttp.port=9222 -Epath.data=dir2/data -Epath.logs=dir2/logs
Set up a cluster and do the following:
TestPolicy
(notice warm hasmin_age: 1s
for testing)Policy JSON
Template JSON
NOTES
exclude._hosts
settingProvide logs (if relevant):
Last few logs after index allocated to a disallowed node:
The text was updated successfully, but these errors were encountered: