Z
Z
zhdoon2018-07-03 06:28:32
Parsing
zhdoon, 2018-07-03 06:28:32

How to protect yourself from parselschikov?

Bad people periodically scrap the site.
Requests come from different IPs, blocking addresses is unrealistic.
Are there effective methods to deal with parsers?

Answer the question

In order to leave comments, you need to log in

9 answer(s)
R
Roman Kitaev, 2018-07-03
@deliro

Not

P
Philipp, 2018-07-03
@zoonman

There are several ways to fight.
Information is given from the server with a constantly changing structure. For example, blocks are swapped and CSS is generated on the fly, and classes are assembled in a random chain and have absolutely random names. This may affect SEO.
If this is not enough, then the content is rendered using JS with similar algorithms. The JS itself is also generated and obfuscated. Content delivery occurs through complex technologies, such as WebRTC DataChannel or WebSockets. About SEO is out of the question, it does not work well through a mobile phone.
With a similar approach, you will have to write a parser for your site. Most likely, he will take a screenshot and feed it to the recognizer.
Access to information is provided to a certain circle of persons. For example, clients. Access volumes are regulated and excess is punishable by termination of the contract or a fine.
Evercookie+fingerprinting is used to identify clients. The rating of addresses and subnets is used.
For untrusted subnets (ips belong mainly to various hosting providers), captcha is displayed immediately. Similarly, when there is traffic from an unusual place, such as sudden traffic from India or China.
Behavioral characteristics are analyzed using machine learning. A reference model is being built.
Everyone who does not fall under the model rests on a slow server. The site starts to give content immediately, but very slowly, for example, the page can open for 30 seconds. Moreover, an attempt to parallel request leads to an error. If the site is large, then such things stop web crawlers with a bang. In addition, certain things are additionally controlled, like "the user uploaded js and css", hovered the mouse here and there.
In addition to the above methods, there are very simple, but effective methods. When parsing is detected, the parser is given incorrect/distorted information in a certain way. For example, if there is a suspicion of a competitor stealing prices, you can give prices slightly higher than the real ones and slightly change the name of the product, for example, replace the letter "a" with "a" in a certain way. Then such a thing is searched by a search engine and a competitor's website is found.
Further, the issue is resolved in a way that is convenient for business. Usually they complain about copyright infringement. Well, or a competitor’s warehouse suddenly burns down. Here, who is on what much.
And this is especially for those who like to parse other people's sites: broken fingers greatly interfere with typing on the keyboard, be careful, in most cases, stealing content is not worth it.
Let's summarize. In most cases, parsing protection harms SEO.
If your content is being stolen, it means it's good. Protect it wisely. Simple tools like copyright and successful content theft cases will keep thieves away from your site. Just make things public. Monitor theft and report to search engines.
Use technology to track theft, such as nonprinting characters and steganography in pictures.
Use internal links and anchors to content and its author. For example, logical references to your previous work or other products that can only be bought from you.
If your articles are being stolen, just demand a backlink.
If a product description is stolen, offer to sell it, and use the money to improve yours or increase turnover or spend it on advertising.
Another recommendation - do everything so that search engines learn about your content before thieves.

A
Alexander Litvinenko, 2018-07-03
@edli007

There are different ways to complicate this task, but by complicating the task of the parsers, you worsen the readability of the site for search engines (they are also parsers)
And in any case, the only way to protect 100% from parsing is not to put it on the Internet, because if the browser can get the content, then any parser can pretend to be a browser.
It's easier to make a paid api to get information, if your info is really needed, many developers will just be too lazy to write parsers and they will offer customers to buy api.

V
Vyacheslav Grachunov, 2018-07-03
@Qwentor

Xs. Captcha like "find all pictures with cars" for too frequent requests?

E
Evgen, 2018-07-03
@Verz1Lka

Use ready-made tools like incapsula or distil networks.

D
Dimonchik, 2018-07-03
@dimonchik2013

1) Cloudflare
2) Captcha with Fingerprints and logs, for (2) the backend must have a resource-guzzling service

D
Dmitry, 2018-07-03
@php10

With a strong desire, you can parse any site. So no way.

D
Dmitry Bashinsky, 2018-07-03
@BashkaMen

You can make important content a picture, generate it on the back

A
Alexander Semchenko, 2018-07-03
@0xcffaedfe

An easy way to protect against Parsing is to make a public api!

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question