J
J
John Freeman2019-08-29 01:47:51
Parsing
John Freeman, 2019-08-29 01:47:51

How to delay parsing in node.js?

Good day!
essence of the question: I have, say, 12,000 pages of one site, with normal parsing with the help of request
and cheerioafter a few pages the site crashes with an error. When parsing, how can I delay and sequentially parse the content of 12,000 pages from a file?
Thanks in advance!

Answer the question

In order to leave comments, you need to log in

1 answer(s)
K
Kovalsky, 2019-09-02
@lazalu68

What do you mean the site crashes ? In sense ceases to return the adequate answer?
Answering the question: at one time, in order to bypass all kinds of protection mechanisms against parsing, I broke the parsing procedure into many many parts: parsing one specific element, bunch (parsing a group of elements), session (bunch consisting of bunches).
In the code, a bunch differed from a session only in the number of processed elements. That is, the parsing algorithm turned out to be something like this:

Обрабатываем элемент №i
  Если ошибка, то
    ждём SINGLE_REQUEST_TIMEOUT
    пробуем еще раз
  i++
  Если остаток от деления i на ITEMS_IN_BUNCH равен нулю, то
    ждём BUNCH_TIMEOUT
  Иначе Если остаток от деления i на ITEMS_IN_SESSION равен нулю, то
    ждём SESSION_TIMEOUT
  Иначе 
    ждём SINGLE_REQUEST_TIMEOUT

I hope I explained clearly. Each of the timeouts covers time intervals of different scales - from seconds and minutes to hours and days. For example, SINGLE_REQUEST_TIMEOUT could be 1000ms, BUNCH_TIMEOUT could be 30000ms, and SESSION_TIMEOUT could be over hours/day. With this approach, I have never encountered parsing problems so far.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question