Will multi_curl improve performance in my case and what are the ways to optimize parsing?

E

Eugene2017-02-26 15:42:57

PHP

Eugene, 2017-02-26 15:42:57

Hello,
There is a server with a large number of IP addresses and a small PHP 7 parser script. There are also several online stores with a huge assortment, ~ 15,000 positions, one position = one page. When a visitor enters the product page, if there is no actual data in the cache, a request is sent to the parser, it processes it and returns the result, which is cached for 30 minutes.
Actually, the parser script itself is straightforward. We get the product ID, parse the required information using curl and return it. At the moment, the parsing server is coping with its task, but the idea that its work can be somehow optimized does not leave. I saw examples with multi_curl, but there is work with an already existing, generated list of addresses, according to which the work is parallelized. I saw an example with Pthreads , in my opinion, this is also not right. Dear experts, what other methods of performance optimization can be carried out? :)

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

A

Andrey Nikolaev, 2017-02-26
@gromdron

Have you tried to do this not at the request of the user (visiting the page), but as a background script that would itself check the relevance and already request data from a third-party server?
I see the following bottleneck:
Suppose 100 users visit the site at the same time and all direct their requests to the same / different pages, the data for which has already expired. Will you send 100 requests to an external server, and then refresh the data in the cache 100 times? And if the service increases the timeout? Well, for example, up to 0.5 seconds per request and you will completely fill up your Internet channel?
Suggested option:
You get the page and save it in the cache, let's say for 1 day and write data on expired for 30 minutes / 1 hour. The background script checks the relevance of these pages once every 30 minutes/1 hour, and if the page has been updated (!) - updates the data, updating the cache and increasing expired. In this script, you can use multi-curl and threads / processes and whatever you want to speed up (at least write in go). If it so happened that for some reason the script did not reach the page for 1 day, then the script sends a corresponding notification to the administrator and manually pulls the page as in the previous case (and still saves it in the cache).
In this case, no matter how many users visit the page, they will always see up-to-date data, and in a clumsy case, you will always get a fallback to the current version and then you will need to figure out what the reason is (ideally, the case with a fallen background script should not generally come, but if it happens - there is a falback)

X

xmoonlight, 2017-02-26
@xmoonlight

Dear experts, what other methods of performance optimization can be carried out?

1. You can optimize parsing for background mode based on prediction. Those. if you requested a product with certain characteristics, then all links with this product for all stores - we put in priority!
2. Use a circular order of domains (round-robin) when processing urls from the queue to increase the intervals between requests to the same site and thereby reduce the risk of a ban.
3. Create an API for your service, to which clients will connect, and in clients, make it possible to receive information about product offers from other sites bypassing your server (only specify this explicitly!). In other words, turn your customers into price consolidators. Thus, you will distribute the load of parsing among the clients. And when clients collect information on a specific product from several URLs at once, let the information be automatically sent via the API to your server.
4. (all of the following is for one domain that needs to be parsed!)
The frequency of parsing for a particular store depends only on the width of your channel, the computing power on your side and the ADEQUATE interval between successive requests to this (one) store. Those. with "round-robin" make sure that requests are no more than once every 5-10 minutes.
"batch": 100 requests with a period of 2 minutes and a pause of 3 hours,
"period": 100 requests for 3 hours, and again "batch",
etc. . in a circle (or in a random sequence).