I
I
IvaYan2014-10-05 22:27:37
Sphinx
IvaYan, 2014-10-05 22:27:37

How to configure shpinx RT index to support Cyrillic?

Hello!
I'm trying to run Sphinx 2.2.4 and ran into a problem with Cyrillic.
The configuration is as follows:
index test_idx {
type = rt # realtime index
path = ./idx/testrt
rt_field = text # field to index
morphology = stem_enru
min_word_len = 1
rt_mem_limit = 256M
}
searchd {
listen = 3307:mysql41
log = ./logs/searchd .log
query_log = ./logs/query.log
read_timeout = 5
max_children = 30
pid_file = ./logs/searchd.pid
seamless_rotate = 1
preopen_indexes = 1
unlink_old = 1
workers = threads # for RT to work
binlog_path = ./idx
}
To communicate with the sphinx, I use SphinxQL through the mysql console client. When I
add a Latin phrase to the index, everything works, the search is performed, etc. But if you
add a phrase in Cyrillic, it will never be found. Obviously, the problem is with the
encoding. The question is how to make the sphinx understand the Cyrillic alphabet?
I've seen it recommended to set charset_type = utf8, but doing so
causes searchd to display the message "WARNING: key 'charset_type' was permanently removed from
Sphinx configuration. Refer to documentation for details." That is, in fact, the Sphinx
says that there is no such key anymore, similar result when using
charset_table.
I repeat once again: I do not have a database that is indexed, data is added to the RT index
through SphinxQL.

Answer the question

In order to leave comments, you need to log in

1 answer(s)
K
klirichek, 2014-11-03
@klirichek

charset_type just determined which encoding (single-byte or unicode) to shove into the sphinx. Obviously, the one-byte is a relic of the past (you need to mess around with code pages, etc.), so it was eliminated. And since there were only two of them, the option was also eliminated.
However, nobody touched charset_table, it still works. There can be no "similar result" with her, check carefully!
Actually, this is the key.
charset_table determines how the input character stream is converted before it is picked up by the sphinx. And here your console client can also influence - because it also sends data to the sphinx in a certain encoding.
In general, there is a "magic command" show meta. It can be run immediately after the request, and it will show exactly what words and how the sphinx was looking for.
As far as I remember, before a typical problem was when just the same charset_type was in sbcs, and utf-8 was indexed. Now this should no longer be, but it is possible that something else has arisen.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question