S
S
sas10242013-06-28 18:41:15
linux
sas1024, 2013-06-28 18:41:15

Detect an automatic parser among site visitors

Sobs-but, I ran into such a problem - a certain bot started up among the site visitors, which parses all the data from the forum and places it on a third-party resource, passing it off as its own. This is quite unpleasant :/

Of course, this resource has not yet been promoted at all, but as I understand it, they want to use my site to fill it with content for the purpose of further monetization.

Can you tell me how to get stuff like this? Maybe there is some analyzer for Apache or nginx?
What is the best way to act in this situation?

Answer the question

In order to leave comments, you need to log in

10 answer(s)
T
TheMengzor, 2013-06-28
@sas1024

The bot will most likely not download images (or will, but only from content). Therefore, add a picture at the bottom of the site with display none, put a creak at the address of the picture, which will prescribe the isHuman=1 parameter to the session, and in other scripts, see if this variable is empty - set it to 0, if it is already zero, then do not give the page at all . What is the idea: the creak will set a flag that this is a person, and the robot will be able to enter the page only once, after which it can already be blocked by session or calculated from other data and blocked.

L
lubezniy, 2013-06-28
@lubezniy

I'll try to offer, perhaps stupid, but an option. The bottom line is to dynamically add some timestamps to posts (you can use style = "display: none"), by which you can determine the date and time of access, and keep full logs (access.log). This will allow us to formulate requirements for the site administration (the material was created at that time, stolen at that time, posted at that time) and to study any permanent and reliable signs of the bot. If the site administration ignores the request, and the hosting is far away and does not want to execute commands from another country, then after studying the technical part, you can, for example, try to slip them (and only them!) Prohibited content, and after auto-stealing it, contact Roskomnadzor - they say they violate .

W
Wott, 2013-06-29
@Wott

You can bypass any filtering, but there are some signs that can weed out obvious bots
1. environment variables - for some reason they are too lazy to copy the result of a typical browser query
2. click speed - bots are either fast or regular. Make a threshold for html requests per minute or consider the variability of delays between requests.
3. downloading / not downloading content - a regular browser downloads images, css and so on, but there are subtleties - for example, some browsers have begun to optimize and do not request invisible content. But the css is clearly needed, for example, a good trigger for a person
4. Clicking links - make a link from a page that the user cannot click and a ready-made trigger for the bot. For reliability, make a random place, class and link parameters
5. Javascript - most bots do not execute it, but there are users without it. You make a css request on page load, for example, what will be a conditional trigger for a person.
In general, you make a filter that checks a bunch of signs and decides by the sum that this is a bot - then either in the session, if there are any, give it any garbage, or chop it. If there are no sessions, then create a rule in iptables/pf/ipfw that you have there for a given ip for an hour or two or a day.
I must say a few words about the necessary bots - search engine spiders, you can pre-filter ip addresses by user-agent, but it is likely that unnecessary bots are disguised as them. So they need to be moderated before being whitelisted.

E
EugeneOZ, 2013-06-29
@EugeneOZ

Goosebumps run from the advice you are given, only a couple of people said about the possible effect on search engines. So much money is invested in SEO - do not close access and do not complicate the way content is delivered, neither users nor search engines will like it.
It's a shame, of course, that those idiots copy like that, but a person will bypass any complication - and the picture will quickly start to pull, and copy the behavior of JS execution. But for users, this is all - extra requests to the server, slowing down the end of rendering, random blocking. And problems with search engines can be much more noticeable than resentment at some site.

V
Vyacheslav Golovanov, 2013-06-28
@SLY_G

If you really bother, you can probably build into the forum a limit on the number of topics viewed by one user, for example ...
But by and large - what got into the Internet belongs to everyone. I learned it a long time ago, and I advise you - the value of the resource is not in uniqueness, but in the fact that it is on it that something appears in the first place. From the same Habr, it is quite possible to dig articles and make another site - what's the point?

I
isden, 2013-06-28
@isden

And on UA/IP to catch if? At the site engine/.htaccess level.

S
Sergey, 2013-06-28
@bondbig

The question here is expediency.
Well, it will be possible to find a bot by behavior (for example, it requests pages too often), what to do with it next? Ban? Bad business. Botovod will change IP address and tactics.
If possible, place unobtrusive copyright on pictures and so on, but this is also a half measure, and users don’t like it very much.

N
Nikolai Turnaviotov, 2013-06-28
@foxmuldercp

The script for deceiving you is written very quickly and simply - offhand:
1) the same tor or some other proxy servers for emulating connections from anywhere;
2) zakos for search bots - you are unlikely to check by IP address where this user came from - for example, the Ukr.net portal was once an ISP, and for example, its dialup users were given an IP block, by which only having logs, it was possible to track what this is not a search crawler, but an end user.
3) Well, changing the version of the OS, the name of the browser in the client is a matter of minutes.
It is easier, as mentioned above, to contact the site administration.

D
demimurych, 2013-06-28
@demimurych

Terribly curious to look at the site

E
Evgeny Elizarov, 2013-07-01
@KorP

Well, you can try to use ddos-guard.net. cURL does not know how to emulate the work of JS, so this is a very easy way to break off the parser, and rewriting it with something more serious is a thankless task.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question