A
A
Andrey Sidorov2018-05-29 21:45:47
elasticsearch
Andrey Sidorov, 2018-05-29 21:45:47

How to get the desired relevance of the results in the search for a multivalue field?

There is an entity that can have an arbitrary number of names.
These names are searched in elastic.
The names lie in one field as an array. The field has a complex analyzer with a complex tokenizer.
The problem I have is that elastic treats a multi-value field (array) as a string, and the relevance of the search results is considered as the total relevance over the entire array, and not as the relevance of one specific matched element of the array.
Below is a highly simplified example.
Create an index

curl -XDELETE 'http://localhost:9200/tests'
curl -XPUT 'http://localhost:9200/tests' -d'{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "tokenizer": "edge_ngram_tokenizer",
          "filter": ["lowercase", "asciifolding"]
        }
      },
      "tokenizer": {
        "edge_ngram_tokenizer": {
          "type": "edgeNGram",
          "min_gram": "3",
          "max_gram": "12",
          "token_chars": ["letter", "digit"]
        }
      }
    }
  },
  "mappings": {
    "test": {
      "properties": {
        "name": {
          "type": "string",
          "analyzer": "my_analyzer"
        }
      }
    }
  }
}'

Filling in the values
curl -XPOST 'http://localhost:9200/tests/test' -d'{ "id": 1, "name": ["text"] }'
curl -XPOST 'http://localhost:9200/tests/test' -d'{ "id": 2, "name": ["text", "text"] }'

Are looking for
curl -XGET 'http://localhost:9200/tests/test/_search' -d'{
  "query": {
    "match": {
      "name": "text"
    }
  }
}'

results
{
  "took": 0,
  "timed_out": false,
  "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 },
  "hits": {
    "total": 2,
    "max_score": 0.7911257,
    "hits": [{
      "_index": "tests",
      "_type": "test",
      "_id": "AWOtIL2gdpqdbX7hdDXg",
      "_score": 0.7911257,
      "_source": { "id": 2, "name": [ "text", "text" ] }
    }, {
      "_index": "tests",
      "_type": "test",
      "_id": "AWOtIL0ldpqdbX7hdDXf",
      "_score": 0.51623213,
      "_source": { "id": 1, "name": [ "text" ] }
    }]
  }
}

As a result, we have id: 2 relevance 0.7911257, and id: 1 relevance 0.51623213.
I need to get both lines on request, and they must have the same relevance.
What to do?
My problem is due to the fact that in the production version of the code I split the name into several subfields that are analyzed by different analyzers, which are then followed by a complex dis_max search with different weights for each of the subfields (boost for a full match, boost for coincidence of the beginning of the line, search by partial match, etc.).
I know two solutions to the problem, but both do not suit me. Perhaps there are some other options?
1. When there are few names, they can be stored in separate fields name_0, name_1, name_2, etc.
When searching, make a dis_max query with tie_breaker: 0 and everything will be fine with relevance.
"query": {
  "dis_max": {
    "queries": [
      { "match": { "name_0": "text" } },
      { "match": { "name_1": "text" } },
      { "match": { "name_2": "text" } }
    ],
    "tie_breaker": 0,
    "boost": 1
  }
}

2. You can store one record for each name in elastic
curl -XPOST 'http://localhost:9200/tests/test' -d'{ "product_id": 1, "name": "text" }'
curl -XPOST 'http://localhost:9200/tests/test' -d'{ "product_id": 2, "name": "text" }'
curl -XPOST 'http://localhost:9200/tests/test' -d'{ "product_id": 2, "name": "text" }'

In this case, the results obtained have to be additionally aggregated by product_id, and in this case we get problems with pagination of the results and with further aggregation of the results.

Answer the question

In order to leave comments, you need to log in

1 answer(s)
A
Andrey Sidorov, 2018-06-04
@morr

I'll answer my own question.
If you want to store an array of names in one field, then you can add to the "name" field "index_options": "docs", and the number of occurrences of the word will no longer affect relevance.
You can also add "norms": { "enabled": false }and the total length of the string will no longer affect relevance.
But all these are crutches that will not help for more complex cases, when you then want to do different tricky combinations of increase / decrease relevance for different situations.
For example, with an array of titles, it will not work to make a relevance boost if one of the titles completely matches the search phrase.
Therefore, the only suitable option I see is building an index in such a way that one product name "maps" to one line in the elastic index.
The article Theory Behind Relevance Scoring helped a lot to understand how relevance is considered .
I also discovered for myself that relevance is affected by how often a word occurs in the entire index https://www.elastic.co/guide/en/elasticsearch/guid... And sometimes it affects very strongly. Therefore, when implementing a search by idf names, the ranking factor should be disabled without fail https://stackoverflow.com/questions/33208587/elast... (works in elastic version starting from 6.2)

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question