A
A
AdNahim2016-08-12 10:16:25
PHP
AdNahim, 2016-08-12 10:16:25

What to do with the parsers who hammer the site?

Taki faced hard parsing of his site (or something else).
The situation is this. The project is beginning to be very useful and visited. There is a button on the page, by clicking on which a table with quite useful information is loaded.
I did it this way:
There is a button on the button data-url="/id/145", and accordingly, when I click js, I load the result of accessing this page.
Statistics showed that they clicked about once every 20-15 minutes. Which is quite proportional to attendance. Today I saw a terrible picture - per minute, 5-10 calls from different ip (pieces 15-20 alternate).
Okay I think. Most likely they stupidly took site.com/id/145/and change the id. Did a little check. When visiting the site, a certain hash is generated and when accessing /id/ it is checked whether the user clicked on the button, or went directly to /id/145/ . But nothing has changed :( So, in some way, there is an imitation of pressing a button.
What can be done? ) The information is not a pity, they just clog terribly real statistics .. Yes, and somehow it doesn’t turn out well.
Thanks to.

Answer the question

In order to leave comments, you need to log in

4 answer(s)
R
Rou1997, 2016-08-12
@alexanderkx

Look at the HTTP headers, use them and JS to determine whether the browser is really or still 'raw HTTP', if the first, then there should be protection on the front-end, if the second, then on the back-end, the essence of protection is to make changes , which the parser is not able to process, but, roughly speaking, if they really want to, then at least every day they will "repair" their bot, you would leave them alone.

V
Vitaly, 2017-03-21
@vshvydky

Guys, you advise him garbage all. I, as a person who sometimes parses, I know that there is no normal way to protect, well, no, and that's it.
From the recommendations:
1. you can not include any recaptcha, captcha, etc., you will annoy real users, and 5-10 kopecks and the bot will solve your captcha, recaptcha. This protection does not work.
2. If you see a bot that appears to be a browser, rejoice and show it ads. You will earn on this.
3. All parsers are tied to CSS selectors and extract data from them. You can build such selectors dynamically, modifying them as much as possible. Perhaps this will strain your offenders.
4. You can limit the return for the user N times per minute / hour / 2 hours / day, then make a blocklist and get banned.
5. You can go to all budget proxy sites, and where it is cheap to buy proxy addresses for yourself, get their lists and immediately ban them. Most often they parse with govnoproxy.
6. Twist dns bl to addresses.
7. Check whether client addresses lead to hosting data centers where VPS, VDS are located. Most often, proxies are sold precisely from the ranges of hosting providers, which means they are easy to calculate.
8. Perhaps the most useful tool, enter the fingerprint in tokens to your registration / authorization. As a result, even if different proxy addresses are used, the fingerprint will match. And then there are too many requests, 1 and the same fingerprint, addresses alternate, are routed to hosting, a direct path to the blacklist.
9. Fighting curl lovers: I understand that this is still a pain in the ass, since almost all shamanism can be unraveled and injected as a post request, but still, how would I do it: a token for a minimum session, 10 - 15 minutes, in a token fingerprint, and in order to get a token, you must first go through the js monitoring ritual in the browser, if this does not happen, do not issue the token. What are the monitoring, a fingerprint for applying effects to an image, then we will take an MD5 print, collect information about installed fonts, collect data from the navigator about plugins and mime types. checking that the user agent matches the behavior of the browser. and what will ruin the life of the botparser even more is that the links for it change every time, like the token. There are a lot of hemorrhoids, but I think after that they won’t want to parse you.
Но я за то , чтобы отдать боту то, что он найдет если не у вас, то у вашего конкурента, но вы ему впулите много много рекламы.
Удачи.
UPD:
Подцепляешь сокеты и следишь за пользовательским поведением, изменение курсора мыши, пробег мыши на сайте, скроллинги, клики и тп, если кликается, то в это время позиция курсора внутри элемента или нет?
есть еще один момент, который может помочь, но я не скажу, я его сам боюсь ))))

Дмитрий Беляев, 2016-08-12
@bingo347 Куратор тега JavaScript

Если заморочится и добавить возможность сторонним сервисам запрашивать данные по rest и повесить на сайте к этому доку, нагрузка от парсеров снижается

P
PooH63, 2016-08-12
@PooH63

Как вариант cookie + csrf. Пока не сверятся ключи, не вызывать эвент отвечающий за статистику

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question