Why does Elasticsearch rank documents this way?

U

un1t2017-01-11 11:52:01

elasticsearch

un1t, 2017-01-11 11:52:01

There are a number of documents:

{ "name" : "Роза"}
{ "name" : "Липестки роз"}
{ "name" : "Роза 150 см"}
{ "name" : "Роза 130 см"}
{ "name" : "Роза 50 см"}
{ "name" : "Роза 30 см"}

Searching for the word "rose":

curl 'http://localhost:9200/test/_search?pretty' -d '{"query": {"match": {"name": "роза"}}}'

We get the following results:

{
  "hits" : {
    "total" : 6,
    "max_score" : 0.4451987,
    "hits" : [ {
      "_id" : "2",
      "_score" : 0.4451987,
      "_source" : {
        "name" : "Липестки роз"
      }
    }, {
      "_id" : "4",
      "_score" : 0.35615897,
      "_source" : {
        "name" : "Роза 130 см"
      }
    }, {
      "_id" : "6",
      "_score" : 0.35615897,
      "_source" : {
        "name" : "Роза 30 см"
      }
    }, {
      "_id" : "1",
      "_score" : 0.30685282,
      "_source" : {
        "name" : "Роза"
      }
    }, {
      "_id" : "5",
      "_score" : 0.15342641,
      "_source" : {
        "name" : "Роза 50 см"
      }
    }, {
      "_id" : "3",
      "_score" : 0.15342641,
      "_source" : {
        "name" : "Роза 150 см"
      }
    } ]
  }
}

Firstly, why do "rose 130 cm" and "rose 150 cm" have such a different weight? They're not even close in search results.
Secondly. Why "rose petals" are in the first place. In my opinion, the "rose" document is clearly more relevant to the query. Usually, search engines take into account the proximity of the word to the beginning of the document. There is no such thing here.
The Russian morphology plugin is used, the settings look like this:

{
    "analysis": {
        "char_filter": {
            "my_charfilter": {
                "type": "mapping",
                "mappings": ["Ё=>Е", "ё=>е"]
            }
        },
        "analyzer": {
            "default_index": {
                "type": "custom",
                "char_filter": ["my_charfilter"],
                "tokenizer": "standard",
                "filter": ["lowercase", "russian_morphology", "my_stopwords"]
            },
            "default_search": {
                "type": "custom",
                "char_filter": ["my_charfilter"],
                "tokenizer": "standard",
                "filter": ["lowercase", "russian_morphology", "my_stopwords"]
            },
            "lower_keyword": {
                "type": "custom",
                "tokenizer": "keyword",
                "filter": "lowercase"
            }
        },
        "filter": {
            "my_stopwords": {
                "type": "stop",
                "stopwords": "а,без,более,бы,был,была,были,было,быть,в,вам,вас,весь,во,вот,все,всего,всех,вы,где,да,даже,для,до,его,ее,если,есть,еще,же,за,здесь,и,из,или,им,их,к,как,ко,когда,кто,ли,либо,мне,может,мы,на,надо,наш,не,него,нее,нет,ни,них,но,ну,о,об,однако,он,она,они,оно,от,очень,по,под,при,с,со,так,также,такой,там,те,тем,то,того,тоже,той,только,том,ты,у,уже,хотя,чего,чей,чем,что,чтобы,чье,чья,эта,эти,это,я"
            }
        }
    }
}

If anyone wants to reproduce it for themselves, here is the Gist on github

Reply

Answer the question

In order to leave comments, you need to log in

0 answer(s)