How to organize protection against site parsing?

makkartnis2014-05-03 21:23:39

PHP

makkartnis, 2014-05-03 21:23:39

I somehow became interested in this question. Summarizing the information that I have collected and a little personal experience, they make it clear that whoever really needs it will pull it off, but it is still possible to complicate it. I work as a PHP + JS web developer in one office, I had to make several custom parsers.
The following questions are of interest:
First: Is there a software that allows you to drag content that is generated dynamically, so that JS execution is required? And here it's not just about ajax, but about the fact that the link to the required content is generated by a replaceable JS function.
Second: The key techniques for preventing automatic copying that I found useful are the following:
1. The same dynamic content as mentioned above.
2. Dynamic layout change (I heard something about a ban from search engines for this).
3. Blocking by ip if not a search bot.
Here I would like to hear your methods, ideas and possible problems associated with them.
I forgot to add here the 4th point: Give one content to search engines, and another to clients.

Answer the question

In order to leave comments, you need to log in

12 answer(s)

starosta6123, 2014-05-03
@starosta6123

1. The site was originally intended for publication, that is, it is open.
2. If you do not want the information to be open, do not publish.
From point 1 it follows that there are no sufficient means to protect against parsers.
The only question is how much you are ready and can complicate life for the parsers.
Is it necessary? Maybe you are "the elusive Joe" ?
Everything that a person can read and recognize (and after all, the site is made for people?) Can be reproduced. Where parsing can be automated, it will be automated.
Now there are powerful parsers of Yandex and Google. If they cannot parse your site, then it will not be in the index either, which means that useful information will not reach the end user.
And whoever wants to, will copy it if the information is really needed. Even if you present it in the form of a mosaic of pictures and pieces, even if you encrypt it, the information on the screen must still be readable, which means that a simple print screen and recognition in FineReader will be faster than you write a protection against it ...
Quit this lesson!
There is no human-made protection that cannot be broken, it's just a matter of time...
The only way is to encrypt with the issuance of a key to the client. But the client - a person is not reliable, and the information will float away, the question of price!
And drop it again!
I also once thought about it, but came to nothing. Any protection complicates the system and increases the number of errors. The user will leave your site faster, just because due to an error in the script, they will not receive useful data.
Last tip: drop it!
The only thing that can help you is not to fully disclose all the information about the subject, or to divide it into several parts, but this should not be an inconvenience to the visitor. For example, hide "the number of teeth in a gear", any key information without which "the plane won't take off".
And if you want to play around, then an idea came to mind: mixing text according to a certain algorithm, which is then restored, applying styles to hide "fake" words or phrases. For example, set a style that hides every second sentence or word. But unfortunately, it breaks with a bang! But it will bring joy to burglars :-)
Sorry for such a big mess!
1. Dynamic requests. Well, they will bring some headache for the cracker, but it's not as difficult as it seems.
2. Layout. I don’t know about the ban from search engines, but this also breaks. Just remove the tags and that's it. It's just that a "smart" filter is added to the parser. Of course, you can replace a picture somewhere with a background, or part of the text with a picture, but you can also make a parser for this.
3. Blocking by IP will not work, since real people can suffer, it is enough to use a dynamic IP.
In general, if you want to save yourself from simple parsers, then a set of measures can help. I can also suggest the idea that parsers are usually very active, and by the number of requests from one IP, by USER_AGENT, and other labels, as well as by the absence of javascript, by processing the <META> tag redirekt.info/article/redirekt -na-html-s-zaderzhko... (delayed redirect) and other features. You can push a hidden image (style="display: none"), most parsers can pull it (depending on the settings).
In general, you can set the task in a different way: "Setting traps for parsers". That is, to catch on what ordinary people and browsers will not do. For example, fill in the "hidden password field". Successful traps will give you the opportunity to identify fake ones, but it is better to do several checks, otherwise you can ban a real user. And I would not ban, but would leak a little or partially changed information. This infa can become a marker for identifying someone who really wants to "merge" with you.
Everyone, good luck!

Yuri Tseretyan, 2014-05-04
@Karasb

I have written quite a few different web parsers and automations of varying complexity, and I can say that the only option is not to publish information at all. I think the following will help discourage the desire to parse the site or at least increase the cost of development / support of the parser:
1. A system for monitoring user behavior (mouse movement, button click coordinates, etc.) in order to detect bots.
2. Do not use Id and name or other attributes by which content can be calculated.
3. Obfuscate CSS and make class names dynamic.
4. Dynamically add various garbage to the markup.
5. Use a web framework, and don't expose methods to the outside.
6. Use captcha, from different vendors and with a dynamically generated url, and load it in such a way that it cannot be pulled out of the browser cache (this will not save you from intercepting the request, but it will spoil the life of automators).
7. Periodically change the layout.
I would not recommend loading content via Ajax: intercepting the request from the browser is not such a big problem, but the content search area immediately narrows.

Puma Thailand, 2014-05-04
@opium

Who needs it and handles spars and that says it all.

Yuri Morozov, 2014-05-04
@metamorph

First: yes. You can also take screenshots at the same time.
By the way, SEOs will kill you for dynamic links.
Second:
1. no
2. potential ban/pessimization
3. won't work, since parsing has long been done through a huge list of proxies
4. pure cloaking, ban immediately
The problem could be partially solved if there was a way to reliably determine the crawler. But, unfortunately, robots can go to the site "undercover" for some of their production needs, and then wrap everything up.

Nazar Mokrinsky, 2014-05-03
@nazarpc

AJAX requests are just requests. It is possible to generate a request using CURL that is absolutely identical to a request from any browser. And you won't do anything here.
You can try to determine the behavior of the visitor, but this is not a rewarding business.
The easiest way is to turn on the corresponding option in the CloudFlare settings (if you use it), because writing such a thing yourself is a very thankless task.

Pavel Volintsev, 2014-08-23
@copist

I used the following methods to protect small fragments of text:
1. Generation of texts in the form of images - usually emails were hidden in this way, but you can generate anything. You can apply watermarks, use a multi-color background, and it is best to insert arbitrary characters in arbitrary places in the same color as the main text - when recognized, the results will be garbage.
2. Inserting junk tags with dynamic random styles into the text

<span class="ADsdas POxlka3">note</span>
<span class="GHJbk KLJHK">x862</span>
<span class="j38jdJ Uu300D">book</p>

In this case, the text looks like a notebook , and if copied through the clipboard, then notex862book .
The noise must be pseudo-random, that is, not dependent on time, weather, or a random number generator. It must depend on the text. This is to avoid restoring uncorrupted text by repeatedly generating a picture or text with "noise".
Both ways lead to a performance drawdown.

svd71, 2014-05-03
@svd71

If you publish something on a huge network, and want to prohibit copying something there, this is utter nonsense.
I use watermarks on pictures and the simplest protection against select-copy-flocks (protection from pioneers, but for a robot this is not a problem).
I look through the logs periodically and the most active "searchers" sooner or later redirect to the Microsoft website. This discouraged a few zealous in the search for the same phpMyAdmin, which I do not use. I'm thinking about how to ban them using iptables.

Leshrac, 2014-05-04
@Leshrac

JS execution and dynamic content is not an obstacle (that's what Headless browsers use - PhantomJS/CasperJS/etc).
It is also convenient to parse using the IE + AutoIt bundle.
If you really need to get information from the site, then they will parse carefully - a month, two, six months. Therefore, as advised above, either hammer it, or approach each case individually.

Eugene, 2014-05-04
@evgeniy_p

For a long time, yandex introduced such a useful feature as confirmation of content authorship. But for this, the resource needs a TIC of at least 10.
And for google, it seems like you need to specify a link to your account in google plus and the article is automatically considered yours. Even in the results with your avatar from this social network, the site will be displayed.
If you constantly dynamically issue layout, for example, change the name of classes and divs, then search engines will not really like it. Google has a handy feature to mark up beacons on a page. You need to specify where you have the text, where is the title, where is the preview, and it will automatically parse your site as it should. But with dynamic layout, you can forget about it.
Conclusion: Passing protection is a dubious undertaking, the SEOs will hate you for it.

Maxim Uglov, 2014-05-03
@Vencendor

1 Load content via Ajax before token generation. But the search engines can get mad.
2 layout should be not just dynamic, but for each entry its own so that search engines are kinder. For example :

switch($news['id']%10) {
     case 1 : echo "<div>".$news['content']."</div>"; break; 
     case 1 : echo "<p>".$news['content']."</p>"; break; 
     case 1 : echo "<span>".$news['content']."</span>"; break; 

// ...........

}

For the most part, this is a dilemma between friendship with search engines and the safety of information. It's better to just report your content to Yandex right away - via addurl, there is also a feature to report new content where the text fits. After that, you can complain about the thief. They say it helps, pessimize the sites of the thief.

zooks, 2014-05-04
@zooks

1. Put information in a closed section.
2. Regularly add new articles, send texts to search engines.
3. Put javascript copy protection by housewives.
4. Profit.
The rest of the actions of piling up crutches simply do not make sense.

FullstackWEB, 2018-01-17
@FullstackWEB

Damn, let your imagination run wild! Find all the differences between the browser and "dead" libraries (I attach my meaning to the word "dead" here, do not pay attention. These are those that stupidly take content at the address in the form of text and that's it, without imitating any actions and without doing what I I’ll write now.Although this can also be done, but let someone try to find out if there is such protection at all on the site :) ).
Browsers cache images. And under the images, you can mask the PHP script.
For example, you access site.ru/image/logo.png, and it, using the rules of the web server (for example, in Apache - RewriteRule), at the URL of this image, produces the result of the PHP script (which in turn produces a real image through itself). NO ONE will notice this. And the PHP script will record the fact of loading the image in the database.
Eventually.
If the image is re-downloaded the next time the server is accessed, then it is not cached by the browser, which means that the site is most likely being parsed. The only pitfall is that the images are unlikely to be parsed, and they will only take the page. But it basically works. If after the first page load the image did not load, then with subsequent requests, you can already take preventive actions (like why didn’t you load the picture? Are you not a browser?). And to catch search engines by their user-agent, because those who write parsers, to disguise themselves as them (and because of their narrow-mindedness ... well, or for greater reliability), use browser user agents, not search engines. Well, there are users who turn off images (which will prevent them from using the site normally). But as I wrote - connect your imagination. And also compare differences between browsers, see what's in your hand. I have only described one example. It may not be perfect, but it's just an example. The main point is for you to find your method on it. There's a lot more you can think of