Precautions when parsing in Python?

B

Bjornie2016-11-14 00:26:36

Python

Bjornie, 2016-11-14 00:26:36

Recently I have been learning Python and I want to do my first project for parsing data from a private area (by authorization).
I looked at one lesson (a gist, but there is also a link to a video on YouTube), according to which everything is quite clear. But the author does not use any modules for authorization, does not send headers, does not use a proxy, etc., so the following questions arise:
- If several thousand pages are to be parsed, what security measures should be taken in order not to be banned?
- Probably. if you put pauses between requests, you can not get banned? ( and how the situation is "scouted" in general, in order to understand: here you can safely parse, but here you will be shown a complex captcha after the first 3 requests ).
- Is it worth parsing from the desktop (as the author did)?
- What simple http-client can you recommend?
- Is it enough to send headers similar to those sent by my own browser?
The data for parsing is generally simple, names, cities and contacts, no JS, pagination.

Reply

Answer the question

In order to leave comments, you need to log in

5 answer(s)

E

el777, 2016-11-17
@Bjornie

If you are serious about scraping, then I recommend paying attention to scrapy - a chic python framework for web scraping.
The problem in the title can be solved without govnokoda.
Total: 1 page of beautiful code, downloads 345 pages from the weblancer in 57 seconds in 16 threads and produces 3420 projects.

O

Optimus, 2016-11-14
Pyan @marrk2

Sometimes it's easier on the contrary to set up a parser in 10 threads and parse everything in 30 minutes until the admins come to their senses than to stretch it xs by how much))

D

Dimonchik, 2016-11-14
@dimonchik2013

a good way is to run wget, if it downloads the entire site - it is single-threaded, then the protection is not really
a trick there - pretend to be Googlebot, take my word for it - very few check the bot, especially if you parse from the VPS in the USA
for VK and others, where spammers reign - there will always be protection, boundaries - look for
headers - see https://pypi.python.org/pypi/fake-useragent/0.1.2

I

iSergios, 2016-11-14
@iSergios

1) Watch out for time slots, use different IPs and accounts (if possible).
2) Probably yes. However, no one will answer you here, everything is very individual. Exploration is always done by trial and error.
3) Yes please. We don't get jailed for parsing. In the worst case, they will be banned. You decide.
4) What do you mean?
5) How can we know?

L

lcd1232, 2016-11-17
@lcd1232

For the test, these libraries are suitable, but if you really want to parse large sites, then you need to use scrapy.

- If you have to parse several thousand pages, what security measures should be taken in order not to be banned?

If there is no authorization, then you can use: user-agent rotation, proxy rotation, random delay.

- Probably. if you put pauses between requests, you can not get banned? (and how the situation is "scouted" in general, in order to understand: here you can safely parse, but here you will be shown a complex captcha after the first 3 requests).

You just write a parser without pauses, if everything is parsed, then there is no protection. In my experience, I can say that very few sites have protection against multiple requests, mostly large projects.

- Is it worth parsing from the desktop (as the author did)?

Certainly.

- Is it enough to send headers similar to those sent by my own browser?

Here you need to look at the protection, basically the user-agent is enough.