Protecting a php site from parsing without harming search engines?

Roman Levkovich2016-01-10 14:40:31

PHP

Roman Levkovich, 2016-01-10 14:40:31

Good afternoon, we are
launching a php project - a database of legal entities,
we want to insure against parsing.
We are considering options:
Limiting the frequency of queries and the amount of downloaded data
Using CAPTCHA
But we are afraid that indexing will suffer.
What do you advise?

Answer the question

In order to leave comments, you need to log in

6 answer(s)

Robot, 2016-01-10
@iam_not_a_robot

It's unrealistic))

Dmitry Entelis, 2016-01-10
@DmitriyEntelis

If there is some content that you plan to earn money on, the solution is to hide it under lock and key after payment.
To take payment either for access to each unit of content, or if the subscription is with a limit on the amount of content in a period of time (so that by conditionally buying a 10-15-20 account you would not be scanned)
An example of such an implementation www.nesprosta.ru/?type=show_home&id =65810 (the first link that came to mind)
If there is no desire to directly earn money on content, then any attempts to protect yourself from parsing are a priori doomed to failure. Any content that you give to the search engine remains in its cache. Even if you set limits on access to the site (which in itself is problematic - there are a great many free proxies, not to mention the ability to raise a hundred or two on Amazon at any time) - anyone who wants to parse you can always parse not you directly, but the cache search engine ( webcache.googleusercontent.com/search?q=cache:YMI-... )

Dmitry Bay, 2016-01-10
@kawabanga

There is only one decent way to protect yourself from parsers - do not show anything to anyone!
in fact, implementation options and ways to get around:
1) captcha - correctly noted, captcha costs 10 kopecks apiece. cheap.
2) pauses in the issuance - bots can fall on such a thing, and parsers simply use a proxy. And, do you have a very large base? I would just walk with pauses every 10 seconds, which gives 6 pages per minute and 360 per hour.
3) ban by ip? and again the proxies decide.
4) it is possible to close access for unauthorized users. But you will lose indexing.
By the way, as an option, I would divide access into two parts - for authorized users and not. But even here there is a way to get around this solution by registering accounts for different emails.
In general, without a chance of software to deny access to strangers.

Dimonchik, 2016-01-10
@dimonchik2013

develop legal methods of protection,
technically - you can try to restrict from non-Russian IP, you can maintain a proxy database (resource-intensive procedure), but all these are half measures from those who are really set to spars

Pretor DH, 2016-01-10
@PretorDH

You need to protect yourself at the application design level on the client side, not the server.
FOR EXAMPLE: Create a page for each client with no structural data, just a general one with the client's name and a bunch of promotional barkers. Data is received by AJAX after an explicit click on the button/image (but not the A tag). And the address is dynamically generated by a js-volume based on, for example, a recoded one-time ticket, client id and information id (but not php-generated). No explicit links on the page or in the js code!!!
In addition, you need to block multiple requests from the same ares. And set a monthly limit for N guest users per day/week/month/year.
I think that's enough for you. Such a page will be indexed at the level of company names. And they won’t be able to parse using a simple machine - you will need to write a special script.
The hardest thing you can come up with is to dynamically generate JS working with data and requests on the server side. This will complicate the work of the programmer by an order of magnitude.
But!
!!! THERE IS NO SILVER BULLET!!!
It is necessary to build multifactorial protection. If everything is done with a sufficient level of complexity, the bydlocoder will not be able to bypass. A professional self-respecting programmer will not do this.

mirosas, 2016-01-10
@mirosas

It depends on the spirit of the parameters:
1. how important it is for you that your information is not parsed.
2. to what extent those who are ready for you need information.
If 2 is greater than 1 then you lose.
I advise you to successfully bypass some kind of anti-parsing system for educational purposes, then create the same one, but without the weak points that you found.
Search robots are a separate issue. Yes. They usually have fixed ip addresses - they can be calculated from them.
Limiting the frequency of requests and the amount of downloaded data
- proxies solve this problem.
Using CAPTCHA
- torture ordinary users with captcha. If I'm not mistaken, solving captchas costs about 10 kopecks per captcha.
In general, the methods you suggest help protect data from scraping worth up to $150.