Efficient multithreading in python?

astrotrain2016-02-20 12:43:05

Python

astrotrain, 2016-02-20 12:43:05

I need multithreading in a scripting language - I tried pthreads for php 7, but despite the statements of the developers, everything works crookedly and crashes for no reason. Therefore, I look in the direction of python (are there any alternatives?), but here and there they write that multithreading in it is implemented in such a way that multithreaded applications are inferior to ordinary ones and that there are some problems with locks and synchronization. Is it so? The experience of multithreading in python is small - simple file downloads using thread, but there, as I understand it, there are more convenient and powerful tools. Actually the question is - what is there with multithreading and what modules are best to use? Tasks: parsing web pages and entering data into the database. Thank you!

Answer the question

In order to leave comments, you need to log in

7 answer(s)

Dimonchik, 2016-02-20
@dimonchik2013

asyncio will solve all the problems (if you master it, of course)))
if like in PHP - see, for example,
toly.github.io/blog/2014/02/13/parallelism-in-one-line
but aiohttp for parsing will be more fun by the
way, Scrapy has already been ported to the 3rd Python, but they themselves say that it’s a bit damp
with regards to multithreading, here is a code example:

if __name__ == '__main__':
    freeze_support()

    pool = Pool(processes=8)
    names = pool.imap_unordered(extract, glob.iglob(GLOBDIR), chunksize=1000)
    for name in names:
        extract(name)

you can search in pieces for examples of
how this is Windows, so freeze_support() , in linux it’s even easier
(processes=8) is not equal to the number of cores (there are 12 of them), it is selected experimentally from loading the processor, which in win is in lin
glob.iglob(GLOBDIR ) reads the list of files by mask, since this is an iterator, and the pool needs a specific finite number, the chunksize=1000 parameter is used,
the extract (name) function decompresses the file at the address name, parses it with lxml and adds fields to the database, this combination loads the percent by 60 -80% when in one stream - 9-12%

kazmiruk, 2016-02-20
@kazmiruk

Well, everyone reads the question so sucks. All the answers above have nothing to do with multithreading. In python, it’s better to forget that there is such a thing as “multithreading”, you choose the wrong technology for this (although there is, of course, pypy, but you don’t know at what stage everything is there. There is also an option using processes, but for me it’s more crutch). And in terms of solving the problem of parsing - yes, you can use asynchrony, but one thread will be used.

Dark Hole, 2016-02-20
@abyrkov

Yes, even Node.js

sim3x, 2016-02-20
@sim3x

It is difficult to make the parser parallel to the
Spider - it is possible, but it makes no sense The
generally recognized spider on python scrapy.org

Nikolai Karelin, 2016-02-25
@nikolay_karelin

When it comes to multithreading, you should immediately note what types of tasks you are interested in.
If we are talking about CPU-bound tasks, and you need to load a multi-core processor, then yes GIL interferes (even prohibits), and in Python you need to compile an extension or use several processes.
If the matter is in IO-bound - these are almost all tasks related to the network, including the web - then, as a rule, the blocking operation of waiting for a response from the network releases the GIL and you can safely use multithreading.
Another thing is that now many people advise using asynchronous frameworks (the same asyncio), which for network tasks give much better performance than native threads (although, it seems, greenlets is even better).

Pavel, 2016-02-21
@Padreramnt

Hello, sorry for answering with a question: do you need to use scripting languages? If yes, then python is better, there is a library in the box: multiprocessing (well, or something like that, I don’t remember exactly, it was a long time ago), in fact, it is dangerous, calling its functions is limited to the main loop:

if __name__ == "__main__":
    вот тут вызываем

Actually, this is real multithreading, because it bypasses the GIL (Global Interpreter Lock) The
whole catch of the GIL is that in its concept a thread is a queue.
For example: we have 3 threads, each one performs a specific function, each has its own; 1 thread is running, for example, 100 operations are performed in it, then the GIL special label goes to the second thread, the first one is frozen. The bottom line is that whoever has the GIL label is executed, the rest are waiting for the GIL label.
If you just need to do something multi-threaded, regardless of the language.
as an option, you can look at C ++, there are 3 options: MPI (there is a build for python, you can run it on a cluster), PThread - OOP threads, quite convenient to use, the main thing is not to lock all threads in a circle, in general, more carefully, OpenMP - the compiler itself will parallelize everything, the main thing is to say which cycles to make parallel, it is present in g++ and msvs2010+ (compiler in Visual Studio).
If the whole problem is to run the program on any platform, then Qt and Mono may be suitable for you.

artem78, 2016-12-30
@artem78

I use perl + the threads module for multi-threaded parsing. Somehow I didn’t make friends with asynchrony, and for the parser, parsing the DOM (and not waiting for the network) takes up most of the time.