Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

_analyze api does not correctly use normalizers when specified #48650

Closed
dougnelas opened this issue Oct 29, 2019 · 2 comments
Closed

_analyze api does not correctly use normalizers when specified #48650

dougnelas opened this issue Oct 29, 2019 · 2 comments
Labels
>bug :Search/Analysis How text is split into tokens

Comments

@dougnelas
Copy link

7.3.1 (bin/elasticsearch --version):

Plugins installed: []

Embedded Java 11 (java -version):

OS version (uname -a if on a Unix-like system):

Description: When using the _analyze api endpoint on an index with normalizer defined in the index settings. The output is coming from the default analyzer instead of the normalizer under test,

Steps to reproduce:

Create a test index

PUT word_delimiter_test
{
  "settings": {
    "analysis": {
      "char_filter": {
        "filter_noisy_characters": {
          "pattern": "[.-:\"]",
          "type": "pattern_replace",
          "replacement": " "
        },
        "convert_dots": {
          "flags": "CASE_INSENSITIVE",
          "pattern": "\\.(net|js|io)",
          "type": "pattern_replace",
          "replacement": "dot$1"
        }
      },
      "filter": {
        "word_delimiter": {
          "split_on_numerics": true,
          "generate_word_parts": true,
          "generate_number_parts": true,
          "catenate_all": true,
          "type": "word_delimiter_graph",
          "type_table": [
            "# => ALPHA",
            "+ => ALPHA"
          ]
        },
        "synonym": {
          "type": "synonym_graph",
          "synonyms": [
            "casp, comptia advanced security practitioner"
          ]
        }
      },
      "analyzer": {
        "test": {
          "char_filter": [
            "convert_dots"
          ],
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "synonym",
            "word_delimiter",
            "flatten_graph"
          ]
        }
      },
      "normalizer": {
        "languages_normalizer": {
          "filter": [
            "trim"
          ],
          "type": "custom",
          "char_filter": [
            "convert_dots",
            "filter_noisy_characters"
          ]
        }
      }
    }
  }
}```

Then run an _analyze endpoint to test the normalizer

GET word_delimiter_test/_analyze
{
  "text": "Wi-fi",
  "normalizer": "languages_normalizer"
}
Expected output should be "wifi"

But the output is analyzed

```json
{
  "tokens" : [
    {
      "token" : "wi",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "fi",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 1
    }
  ]
}
@jtibshirani jtibshirani added :Search/Analysis How text is split into tokens >bug labels Oct 29, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Analysis)

@jtibshirani
Copy link
Contributor

jtibshirani commented Oct 29, 2019

Thanks @dougnelas for raising this, it indeed looks like we have a bug there.

cbuescher pushed a commit that referenced this issue Nov 14, 2019
Currently the `_analyze` endpoint doesn't correctly use normalizers specified
in the request. This change fixes that by returning the resolved normalizer from
TransportAnalyzeAction#getAnalyzer and updates test to be able to catch this
in the future.

Closes #48650
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Search/Analysis How text is split into tokens
Projects
None yet
Development

No branches or pull requests

3 participants