Skip to content

_analyze api does not correctly use normalizers when specified #48650

@dougnelas

Description

@dougnelas

7.3.1 (bin/elasticsearch --version):

Plugins installed: []

Embedded Java 11 (java -version):

OS version (uname -a if on a Unix-like system):

Description: When using the _analyze api endpoint on an index with normalizer defined in the index settings. The output is coming from the default analyzer instead of the normalizer under test,

Steps to reproduce:

Create a test index

PUT word_delimiter_test
{
  "settings": {
    "analysis": {
      "char_filter": {
        "filter_noisy_characters": {
          "pattern": "[.-:\"]",
          "type": "pattern_replace",
          "replacement": " "
        },
        "convert_dots": {
          "flags": "CASE_INSENSITIVE",
          "pattern": "\\.(net|js|io)",
          "type": "pattern_replace",
          "replacement": "dot$1"
        }
      },
      "filter": {
        "word_delimiter": {
          "split_on_numerics": true,
          "generate_word_parts": true,
          "generate_number_parts": true,
          "catenate_all": true,
          "type": "word_delimiter_graph",
          "type_table": [
            "# => ALPHA",
            "+ => ALPHA"
          ]
        },
        "synonym": {
          "type": "synonym_graph",
          "synonyms": [
            "casp, comptia advanced security practitioner"
          ]
        }
      },
      "analyzer": {
        "test": {
          "char_filter": [
            "convert_dots"
          ],
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "synonym",
            "word_delimiter",
            "flatten_graph"
          ]
        }
      },
      "normalizer": {
        "languages_normalizer": {
          "filter": [
            "trim"
          ],
          "type": "custom",
          "char_filter": [
            "convert_dots",
            "filter_noisy_characters"
          ]
        }
      }
    }
  }
}```

Then run an _analyze endpoint to test the normalizer

GET word_delimiter_test/_analyze
{
  "text": "Wi-fi",
  "normalizer": "languages_normalizer"
}
Expected output should be "wifi"

But the output is analyzed

```json
{
  "tokens" : [
    {
      "token" : "wi",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "fi",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 1
    }
  ]
}

Activity

elasticmachine

elasticmachine commented on Oct 29, 2019

@elasticmachine
Collaborator

Pinging @elastic/es-search (:Search/Analysis)

jtibshirani

jtibshirani commented on Oct 29, 2019

@jtibshirani
Contributor

Thanks @dougnelas for raising this, it indeed looks like we have a bug there.

added a commit that references this issue on Nov 14, 2019
6ce0442
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @javanna@jtibshirani@elasticmachine@dougnelas

        Issue actions

          _analyze api does not correctly use normalizers when specified · Issue #48650 · elastic/elasticsearch