Why are some words searched incorrectly in Elasticsearch?

U

un1t2016-04-13 13:00:15

elasticsearch

un1t, 2016-04-13 13:00:15

I am using the russian_morphology plugin .
The analyzer turns the last name "petrov" into the token "petrov", and the last name "petrov" into "petrov" and "petr". Everything is fine here, as it should be.
And there is another surname - "accurate", it turns into "accurate", and the same surname in the nominative case of the masculine "accurate" turns into "accurate". And here comes the problem. At the request of "accurate" we will not be able to find, for example, the phrase "portrait of Akkuratov."
Below are the settings and examples of queries to the analyzer.

"settings" : {
      "index" : {
        "analysis" : {
          "filter" : {
            "my_stopwords" : {
              "type" : "stop",
              "stopwords" : "а,без,более,бы,был,была,были,было,быть,в,вам,вас,весь,во,вот,все,всего,всех,вы,где,да,даже,для,до,его,ее,если,есть,еще,же,за,здесь,и,из,или,им,их,к,как,ко,когда,кто,ли,либо,мне,может,мы,на,надо,наш,не,него,нее,нет,ни,них,но,ну,о,об,однако,он,она,они,оно,от,очень,по,под,при,с,со,так,также,такой,там,те,тем,то,того,тоже,той,только,том,ты,у,уже,хотя,чего,чей,чем,что,чтобы,чье,чья,эта,эти,это,я"
            }
          },
          "char_filter" : {
            "my_charfilter" : {
              "type" : "mapping",
              "mappings" : [ "Ё=>Е", "ё=>е" ]
            }
          },
          "analyzer" : {
            "my_analyzer" : {
              "filter" : [ "lowercase", "russian_morphology", "my_stopwords" ],
              "char_filter" : [ "my_charfilter" ],
              "type" : "custom",
              "tokenizer" : "standard"
            }
          }
        },

$ curl -XGET 'localhost:9200/myindex/_analyze?pretty&tokenizer=standard&token_filters=russian_morphology' -d 'petrova'

{
  "tokens" : [ {
    "token" : "петров",
    "start_offset" : 0,
    "end_offset" : 7,
    "type" : "<ALPHANUM>",
    "position" : 1
  } ]
}

$ curl -XGET 'localhost:9200/myindex/_analyze?pretty&tokenizer=standard&token_filters=russian_morphology' -d 'petrov'

{
  "tokens" : [ {
    "token" : "петров",
    "start_offset" : 0,
    "end_offset" : 6,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
    "token" : "петр",
    "start_offset" : 0,
    "end_offset" : 6,
    "type" : "<ALPHANUM>",
    "position" : 1
  } ]
}

$ curl -XGET 'localhost:9200/myindex/_analyze?pretty&tokenizer=standard&token_filters=russian_morphology' -d 'neat'

{
  "tokens" : [ {
    "token" : "аккурат",
    "start_offset" : 0,
    "end_offset" : 9,
    "type" : "<ALPHANUM>",
    "position" : 1
  } ]
}

$ curl -XGET 'localhost:9200/myindex/_analyze?pretty&tokenizer=standard&token_filters=russian_morphology' -d 'neat'

{
  "tokens" : [ {
    "token" : "аккуратов",
    "start_offset" : 0,
    "end_offset" : 10,
    "type" : "<ALPHANUM>",
    "position" : 1
  } ]
}

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

Z

Zakharov Alexander, 2016-04-14
@AlexZaharow

Not exactly an answer, but there is a forum where they discuss problems with ES and morphology in particular in Russian (if relevant): https://discuss.elastic.co/c/in-your-native-tongue...
And it's worth looking at the issue project of this plugin (more closed, because there are no open answers):
https://github.com/imotov/elasticsearch-analysis-m...
Plugin author Igor Motov, I asked him questions.
I had a close issue when using wildcard but it turned out it was not a morphology issue.