Asynchronous multithreading in PHP: why?

M

mat0thew2014-09-16 23:59:16

PHP

mat0thew, 2014-09-16 23:59:16

Hello
Why is it needed at all, please explain, knowledgeable gurus. ;)
Interested in working with the network. Those. CURL, Socket, HTTP.

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

S

Sergey, 2014-09-17
Protko @Fesor

Everything is very simple. Here's an approximate value for data access timings:
That is, having requested data on the network, we stupidly wait. We wait a long time and do nothing.
In the case of curl (aka HTTP), we can build a queue of requests and send them in one fell swoop and wait until all the documents in the queue are downloaded to process the results. If we want to fetch 10 documents, then without multi curl it would take us "average time to fetch a document" * 10. And this is approximately. In the case of multicurl, we get the processing time of 10 requests as the execution time of the longest request. If we imagine that the request time is always the same, we get a gain of about 10 times.
Sockets are more fun. They are blockable (by default) and non-blocking (set with the O_NONBLOCK option). To begin with, let's define what reading data from sockets is and how the operating system provides us with this business. Simplified, when we create a socket, we simply ask the operating system to provide one. Each socket has a read buffer and a write buffer. If the write buffer is full, the OS starts sending data until the buffer is empty (the write buffer is needed to check whether the packets have arrived and resend in which case, this buffer is also involved in the choice of packet sizes by the operating system, etc. This is not particularly important in the context question). When data arrives on the socket, it is first placed in the read buffer. There they lie until they are asked to return from the code. So we can be sure that the data will not be lost.
So... let's take blocking sockets and try to request 1024 bytes of data from it. Moreover, the client is not sending anything at the moment, the read buffer is empty. And so let's say 10 minutes. As soon as we made a request for data, and it turned out that the read buffer is empty, the execution process is blocked until the data appears.
And now imagine that we need to periodically check the availability of data not in one socket, but in a dozen. Let's also imagine that 9 clients connected via our sockets are good and send data on time, and one is not good and likes to be stupid for half an hour. If we were using blocking sockets, then we could only handle one client at a time. Moreover, if he suddenly did not have any data, we will have to wait, although some data for processing could already appear in other sockets. And if in the case of "good" clients we can spend on them for half a second - a second, then when we stumble upon a bad client, our server freezes in the same half hour that we agreed on. The server stupidly waits for a "bad" client, and the good ones eventually cannot reach the server. We will not establish new connections either ... in short, everything is dead.
And here the O_NONBLOCK option comes to the rescue. In this case, if the socket has an empty read buffer, it will immediately return execution without returning a drop of data to us, without waiting for slow slow-witted clients. If the buffer is not empty - everything will be the same as in the case of blocking sockets - it will stupidly return the contents of the buffer and return control. So we can just check all the sockets one by one in an infinite loop. In this case, the data acquisition delay will be minimized.
And it seems like everything is fine, but only an endless loop without blocking is a full processor load. Not good. With blocking calls, the load is not greater (depending on the task), but then our server will respond very slowly. But it is not all that bad.
There is also such a wonderful thing that the operating system provides as select or epol (in the context of php socket_select and stream_select). These functions allow us to feed arrays of sockets that you are monitoring (not sockets, but their descriptors, but not the essence, and not one array but three, an array of descriptors to monitor whether data has appeared for reading, whether the socket has written everything and whether the write buffer has been freed and the third monitors sockets in which some errors occurred, for example, the connection fell off). You can also set a timeout for this function, which is very convenient if we first collect data from several clients and if there was no news from them for a couple of seconds, then we took everything and we can start processing.
Playing around with timings, etc. you can achieve the same minimum performance loss. as if we were just using an infinite loop, and at the same time the load on the system will not be much higher than when using ordinary blocking sockets.
But everything above only makes sense with TCP/TLS, if we had UDP sockets it would be even more fun. There are no buffers there. Did not receive data - lost data. There are no connections. There are no blockages. There are only packages. Therefore, this protocol is used (or used as a basis) for the implementation of real-time systems. There are no delays, and if which unit did not reach, there is a high probability that it is no longer relevant. True, if the network is not reliable and the packet loss is large, then pain and tears begin, and usually, for such cases, everything is duplicated on TCP.

D

Dmitry Skogorev, 2014-09-17
@EnterSandman

What is multithreading for? To perform some action in parallel.
For example parsing something.
Get 100 pages in a hundred seconds sequentially in 1 thread or 100 in 1 second in 100 threads - is there a difference?