Answer the question
In order to leave comments, you need to log in
How to get the desired relevance of the results in the search for a multivalue field?
There is an entity that can have an arbitrary number of names.
These names are searched in elastic.
The names lie in one field as an array. The field has a complex analyzer with a complex tokenizer.
The problem I have is that elastic treats a multi-value field (array) as a string, and the relevance of the search results is considered as the total relevance over the entire array, and not as the relevance of one specific matched element of the array.
Below is a highly simplified example.
Create an index
curl -XDELETE 'http://localhost:9200/tests'
curl -XPUT 'http://localhost:9200/tests' -d'{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "edge_ngram_tokenizer",
"filter": ["lowercase", "asciifolding"]
}
},
"tokenizer": {
"edge_ngram_tokenizer": {
"type": "edgeNGram",
"min_gram": "3",
"max_gram": "12",
"token_chars": ["letter", "digit"]
}
}
}
},
"mappings": {
"test": {
"properties": {
"name": {
"type": "string",
"analyzer": "my_analyzer"
}
}
}
}
}'
curl -XPOST 'http://localhost:9200/tests/test' -d'{ "id": 1, "name": ["text"] }'
curl -XPOST 'http://localhost:9200/tests/test' -d'{ "id": 2, "name": ["text", "text"] }'
curl -XGET 'http://localhost:9200/tests/test/_search' -d'{
"query": {
"match": {
"name": "text"
}
}
}'
{
"took": 0,
"timed_out": false,
"_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 },
"hits": {
"total": 2,
"max_score": 0.7911257,
"hits": [{
"_index": "tests",
"_type": "test",
"_id": "AWOtIL2gdpqdbX7hdDXg",
"_score": 0.7911257,
"_source": { "id": 2, "name": [ "text", "text" ] }
}, {
"_index": "tests",
"_type": "test",
"_id": "AWOtIL0ldpqdbX7hdDXf",
"_score": 0.51623213,
"_source": { "id": 1, "name": [ "text" ] }
}]
}
}
"query": {
"dis_max": {
"queries": [
{ "match": { "name_0": "text" } },
{ "match": { "name_1": "text" } },
{ "match": { "name_2": "text" } }
],
"tie_breaker": 0,
"boost": 1
}
}
curl -XPOST 'http://localhost:9200/tests/test' -d'{ "product_id": 1, "name": "text" }'
curl -XPOST 'http://localhost:9200/tests/test' -d'{ "product_id": 2, "name": "text" }'
curl -XPOST 'http://localhost:9200/tests/test' -d'{ "product_id": 2, "name": "text" }'
Answer the question
In order to leave comments, you need to log in
I'll answer my own question.
If you want to store an array of names in one field, then you can add to the "name" field "index_options": "docs"
, and the number of occurrences of the word will no longer affect relevance.
You can also add "norms": { "enabled": false }
and the total length of the string will no longer affect relevance.
But all these are crutches that will not help for more complex cases, when you then want to do different tricky combinations of increase / decrease relevance for different situations.
For example, with an array of titles, it will not work to make a relevance boost if one of the titles completely matches the search phrase.
Therefore, the only suitable option I see is building an index in such a way that one product name "maps" to one line in the elastic index.
The article Theory Behind Relevance Scoring
helped a lot to understand how relevance is considered .
I also discovered for myself that relevance is affected by how often a word occurs in the entire index https://www.elastic.co/guide/en/elasticsearch/guid... And sometimes it affects very strongly. Therefore, when implementing a search by idf names, the ranking factor should be disabled without fail https://stackoverflow.com/questions/33208587/elast... (works in elastic version starting from 6.2)
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question