Answer the question
In order to leave comments, you need to log in
How to counteract site scraping?
There is a popular real estate resource that contains many pictures of houses with descriptions. Recently, someone has been parsing the site and the traffic limits of the provider suffer from this. How can you resist parsing? In what direction to dig?
There are a whole bunch of IP-shnikov in the logs, it is not clear who to block.
Answer the question
In order to leave comments, you need to log in
To parse, you need to define some kind of template, how to search and by what criterion. Here is an option to complicate the task: captcha - to display some kind of information, ask for some kind of confirmation from the client. Break tags where it can be done painlessly, do not close them, implement the display of the same information in different ways. In general, add crap to parser writers) As soon as you notice that someone is parsing, change the page in detail.
change hosting to another with unlimited traffic.
And preferably not shared.
Set up a robots file. Maybe this is a good person and immediately the load will go away.
No way, for the most part. The only solution is to block ips that make too many requests per period using http_limit_req , for example.
You have not yet seen how search engines attack sites, in parallel with several IPs, they pull several pages, on shared hosting, load limits immediately work and block.
If pictures are parsed, then this is called hot-linking: here .
If the text is being parsed, then only by the behavioral filter should be filtered out by headers and IP / DNS.
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question