How to counteract site scraping?

P

pr0r0k_d2015-12-28 17:28:05

System administration

pr0r0k_d, 2015-12-28 17:28:05

There is a popular real estate resource that contains many pictures of houses with descriptions. Recently, someone has been parsing the site and the traffic limits of the provider suffer from this. How can you resist parsing? In what direction to dig?
There are a whole bunch of IP-shnikov in the logs, it is not clear who to block.

Reply

Answer the question

In order to leave comments, you need to log in

7 answer(s)

E

eRKa, 2015-12-28
@kttotto

To parse, you need to define some kind of template, how to search and by what criterion. Here is an option to complicate the task: captcha - to display some kind of information, ask for some kind of confirmation from the client. Break tags where it can be done painlessly, do not close them, implement the display of the same information in different ways. In general, add crap to parser writers) As soon as you notice that someone is parsing, change the page in detail.

A

alexxandr, 2015-12-28
@alexxandr

change hosting to another with unlimited traffic.
And preferably not shared.

N

Nivka, 2015-12-29
@Nivka

Set up a robots file. Maybe this is a good person and immediately the load will go away.

A

Alexey Ukolov, 2015-12-28
@alexey-m-ukolov

No way, for the most part. The only solution is to block ips that make too many requests per period using http_limit_req , for example.

P

Pavel Belyaev, 2015-12-28
@PavelBelyaev

You have not yet seen how search engines attack sites, in parallel with several IPs, they pull several pages, on shared hosting, load limits immediately work and block.

X

xmoonlight, 2015-12-28
@xmoonlight

If pictures are parsed, then this is called hot-linking: here .
If the text is being parsed, then only by the behavioral filter should be filtered out by headers and IP / DNS.

V

Vasily, 2015-12-28
@Foolleren

as a joke - to remake the site under WebGL instead of images, cunningly cut textures, generally draw text with polygons, polish everything with encryption and a constant change in the algorithm - let them suffer.