What is the correct way to use the multiprocessing module with PostgreSQL?

B

beduin012019-02-27 12:31:21

Python

beduin01, 2019-02-27 12:31:21

As a result, under pressure from the public, I decided to abandon SQLite even for storing links to files. Moved everything to PostgreSQL.
Adding threads to the application did not give any sense. As I was advised here, I decided to try the multiprocessing module to spread the processes across the cores. My vod looks like this:

def select_single_file_for_processing():
  #...
  sql = """UPDATE processing_files SET "isProcessing" = 'TRUE' WHERE "xml_name"='{0}'""".format(xml_name)
    cursor.execute(sql)
    conn.commit()  	

def worker():
    result = select_single_file_for_processing() # получаем файл для обработки    
    # ...
    # processing()

def main():
  # ....
    while unprocessed_xml_count != 0:

        checker_thread = threading.Thread(target=select_total_unpocessed_xml_count)
        checker_thread.start() # проверяем есть ли еще данные для обработки

        for i in range(10): # запускаем сами процессы
            t = Process(target=worker)
            t.start()

Unfortunately, I did not see any performance gain, on the contrary, the variant with processes began to work several times slower than the single-threaded variant. Where is the mistake?
How to organize access to the variable: unprocessed_xml_count correctly? And then I get every process pulling it and I'm sure it's not spelled correctly
. I tried to do a very stupid option in general. I have 1000 files in the db. Without any synchronization, I did this:

for x in range(1000):
            for i in range(3):
                t = Process(target=worker)
                t.start()
                t.join()

Type on three processes to start. But the speed is also wildly slow.

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

M

Melkij, 2019-02-27
@melkij

decided to abandon SQLite even for storing file links

What for?
sqlite is an excellent most common database in the world (see android, firefox, chrome - sqlite everywhere)
I can hardly imagine sqlite on a terabyte scale, but on a scale of a couple of megabytes postgresql will be a clear overkill.
Well, on the topic - I'm not a pythonist and from the given code it's completely unclear to me what you are doing. Profile your application, look for where you are wasting time.

F

Fian, 2019-02-27
@Fian

Each of the processes will create a connection to the database. Creating a connection is a relatively expensive operation (due to latency) and may take more than the requests themselves in this case. Also, creating a new process is also an expensive operation. But, frankly, without profiling, answering this question is pointing your finger at the sky.