B
B
beduin012015-03-01 20:51:47
Parsing
beduin01, 2015-03-01 20:51:47

Does it make sense to make a multi-threaded parser?

I am writing a parser for a large site. Got a question. Should it be multi-threaded?
1. Won't too many streams result in an IP parser ban? Can this be verified experimentally?
2. The parsing speed will increase in proportion to the number of threads, or is it not so simple?
3. What restrictions can I run into because of what the speed will not increase?

Answer the question

In order to leave comments, you need to log in

2 answer(s)
G
Gluck Virtualen, 2015-03-01
@gluck59

I did the Ebay parser, first in one thread, then with a multicurl. The speed has increased by 15 times.
In order not to be banned, I created a bunch of different user agents and each time randomly fed the multicurl a new one.
It is clear that the IP was still the same, but no one knows what autobahn mechanism is there (and they are there). As a result, they didn’t get banned for a year :)

A
Alexey Sundukov, 2015-04-16
@alekciy

1. Of course it can lead. Before grabbing, you can always read robot.txt for Crawl-delay/Request-rate directives. However, real numbers can be obtained only in the course of work (by the occurrence of HTTP statuses other than 200). For good, in the course of work, you need to accumulate statistics and dynamically adjust the bypass speed.
2. No, as in conventional technology, it is not proportional. The degree of efficiency drop depends of course on the architecture of the application.
3. Various kinds of blocking (I / O of the disk / network, writing to the database, etc.), OS restrictions (number of open ports, limits on disk I / O), low speed of return from the resource being plundered.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question