D
D
DamskiyUgodnik2019-02-17 13:19:24
Python
DamskiyUgodnik, 2019-02-17 13:19:24

How to properly connect to sphinx in python3 and not have encoding issues?

Hello!
Faced the problem of displaying data from sphinx. The problem is with the display of the Cyrillic alphabet in the console (namely, the data that comes from the sphinx). Oddly enough, it was not possible to solve the problem right away, googling also did not give results. I assume that somewhere there is a setting due to which the data comes in the "wrong" format.
So, what we have:
1. If we get the data (by which we build the index) directly from mysql and output everything to the console
2. If we connect to the sphinx in the console, the data is also displayed normally
3. If we connect via python, we get data from the sphinx and we output to the console, we get krakozyabry.
The script with the simplest query to the sphinx that displays "crazy":

# -*- coding: utf-8 -*-
import MySQLdb, MySQLdb.cursors

sphinx_db = MySQLdb.connect(host='127.0.0.1',port=9306,user='',passwd='',db='', charset='utf8', use_unicode = True, init_command='SET NAMES UTF8')
sphinx_cursor = sphinx_db.cursor(cursorclass=MySQLdb.cursors.DictCursor)

mysql_db = MySQLdb.connect(host='127.0.0.1' ,user='*****', passwd='*****', db='*****', charset='utf8')
mysql_cursor = mysql_db.cursor(cursorclass=MySQLdb.cursors.DictCursor)

def main():
    
    # Выводит кракозябры
    sql = """SELECT * FROM documents LIMIT 1"""
    sphinx_cursor.execute(sql)
    sphinx_db.commit()
    data = sphinx_cursor.fetchone()

    print(data['text'])

    # Напрямую из mysql выводит нормально
    sql = """SELECT * FROM documents LIMIT 1"""
    mysql_cursor.execute(sql)
    mysql_db.commit()
    data = mysql_cursor.fetchone()

    print(data['text'])

if __name__ == '__main__':
    main()

sphinx configuration:
common
{
    lemmatizer_base   = /home/www/sphinx_data/dict
}
source src_documents
{
    type                = mysql
    sql_host            = localhost
    sql_user            = *****
    sql_pass            = *****
    sql_db              = *****
    sql_port            = 3306
    sql_query           = SELECT id, text FROM documents

    sql_query_pre       = SET NAMES utf8
    sql_query_pre       = SET CHARACTER SET utf8
    sql_query_pre       = SET CHARACTER_SET_RESULTS=utf8

    sql_field_string    = text

}
index documents
{
    source            = src_documents
    path              = /home/www/sphinx_data/p1
    docinfo           = extern
    charset_type      = utf-8
    morphology        = stem_en, stem_ru, soundex
    min_word_len = 3
    enable_star = 1
    min_infix_len = 3
    wordforms       = /home/www/sphinx_data/dict/words.txt
    charset_table = 0..9, A..Z->a..z, _, a..z, \
        U+410..U+42F->U+430..U+44F, U+430..U+44F, U+401->U+0435, U+451->U+0435, U+02D
}
searchd
{
  listen            = 127.0.0.1:9306:mysql41

  log               = /var/log/sphinxsearch/searchd.log
  query_log         = /var/log/sphinxsearch/query.log
  read_timeout      = 5
  max_children      = 30
  pid_file          = /home/www/sphinx_data/searchd.pid

  max_matches       = 1000
  seamless_rotate   = 1
  preopen_indexes   = 1
  unlink_old        = 1
}

Tell me which way to dig.

Answer the question

In order to leave comments, you need to log in

2 answer(s)
D
DamskiyUgodnik, 2019-02-18
@DamskiyUgodnik

It turned out that the whole point is in the features of the connector to the database and when connecting, you need to specify use_unicode = False, then everything works fine if you convert the data via .decode ('utf8') when displaying

P
Puma Thailand, 2019-02-17
@opium

Install the latest version, in theory, you can remove everything about utf8 from the config

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question