-
Notifications
You must be signed in to change notification settings - Fork 25.4k
Closed
Labels
:Search Relevance/AnalysisHow text is split into tokensHow text is split into tokens>bugTeam:Search RelevanceMeta label for the Search Relevance team in ElasticsearchMeta label for the Search Relevance team in Elasticsearch
Description
7.3.1 (bin/elasticsearch --version
):
Plugins installed: []
Embedded Java 11 (java -version
):
OS version (uname -a
if on a Unix-like system):
Description: When using the _analyze api endpoint on an index with normalizer defined in the index settings. The output is coming from the default analyzer instead of the normalizer under test,
Steps to reproduce:
Create a test index
PUT word_delimiter_test
{
"settings": {
"analysis": {
"char_filter": {
"filter_noisy_characters": {
"pattern": "[.-:\"]",
"type": "pattern_replace",
"replacement": " "
},
"convert_dots": {
"flags": "CASE_INSENSITIVE",
"pattern": "\\.(net|js|io)",
"type": "pattern_replace",
"replacement": "dot$1"
}
},
"filter": {
"word_delimiter": {
"split_on_numerics": true,
"generate_word_parts": true,
"generate_number_parts": true,
"catenate_all": true,
"type": "word_delimiter_graph",
"type_table": [
"# => ALPHA",
"+ => ALPHA"
]
},
"synonym": {
"type": "synonym_graph",
"synonyms": [
"casp, comptia advanced security practitioner"
]
}
},
"analyzer": {
"test": {
"char_filter": [
"convert_dots"
],
"tokenizer": "whitespace",
"filter": [
"lowercase",
"synonym",
"word_delimiter",
"flatten_graph"
]
}
},
"normalizer": {
"languages_normalizer": {
"filter": [
"trim"
],
"type": "custom",
"char_filter": [
"convert_dots",
"filter_noisy_characters"
]
}
}
}
}
}```
Then run an _analyze endpoint to test the normalizer
GET word_delimiter_test/_analyze
{
"text": "Wi-fi",
"normalizer": "languages_normalizer"
}
Expected output should be "wifi"
But the output is analyzed
```json
{
"tokens" : [
{
"token" : "wi",
"start_offset" : 0,
"end_offset" : 2,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "fi",
"start_offset" : 3,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 1
}
]
}
Metadata
Metadata
Assignees
Labels
:Search Relevance/AnalysisHow text is split into tokensHow text is split into tokens>bugTeam:Search RelevanceMeta label for the Search Relevance team in ElasticsearchMeta label for the Search Relevance team in Elasticsearch
Type
Projects
Milestone
Relationships
Development
Select code repository
Activity
elasticmachine commentedon Oct 29, 2019
Pinging @elastic/es-search (:Search/Analysis)
jtibshirani commentedon Oct 29, 2019
Thanks @dougnelas for raising this, it indeed looks like we have a bug there.
Fix `_analyze` API to correctly use normalizers when specified (#48866)