Answer the question
In order to leave comments, you need to log in
What is the best way to organize parsing and what are the restrictions on target sites?
I'm doing a news service. The essence is automatic parsing of news from everywhere, from where only users need.
Now I have 48 sources. Parsing happens like this: once a minute, the server kicks a PHP script that takes the "next" source from the database and parses it. It turns out that now one source is viewed once every 48 minutes.
Sites are grouped by region. In one region, in the future, there may be hundreds of sites. And just in the system - thousands. I don’t know if the project will grow to such a scale or not, but the current parsing method is no longer convenient.
You need to get news as soon as possible. First, it is a matter of relevance. The target audience of the site is journalists. Secondly, several news may appear on the site in an hour. And then when displaying the "latest" news, the user sees several news from one rubric. It's like there are no others.
Question: how to organize parsing?
Check every site every minute/2/5 minutes?
But here it turns out a big load on the server in the future. And can such activity cause my server to be blocked by the target site? Maybe there are other restrictions.
Categorize? But here it can still take a long time if there are a lot of sites in any category.
In general, I do not know what approach is needed in this case.
Answer the question
In order to leave comments, you need to log in
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question