Answer the question
In order to leave comments, you need to log in
How to do "background" parsing?
Hello, I use “simple html dom” (php) for parsing, in the future it will be necessary to parse 10 sites. When the page is loaded, it takes quite a long time until all the information is loaded and the page is displayed. If you parse 1-2 sites, it is still tolerable, but if you take more, it will be bad. I thought that it would not be bad to parse the data every 10-20 minutes (the information I need is often updated), and save it all in the database.
Can you please tell me the best way to implement this? Is "cron" suitable for this task, or are there more "correct" methods?
Answer the question
In order to leave comments, you need to log in
In order for the parser to start itself, cron is suitable. But as I understand the essence of the question is different: "how to quickly parse XX sites." Multithreaded. mCurl to the rescue.
For orientation in speed, I will give the scheme I use. I have almost 42,000 URLs to check. Before starting work, they are stacked in Redis in the form of a stack (so that later parallel threads do not download the same page several times). After that, 10 php scripts are launched via cron through bash, each downloading 100 addresses at a time, parsing data from the page through the DOM, writing the received data to the database. Those. in addition to page jumps, there are also slow operations such as building the DOM and writing to the RDBMS. Everything takes less than 20 minutes, i.e. the minimum speed is about 30 pages/sec.
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question