P
P
Pogran2017-10-15 23:33:09
reCAPTCHA
Pogran, 2017-10-15 23:33:09

How to detect a parser?

How can you reliably and quickly identify that the site is being parsed by a parser?

Answer the question

In order to leave comments, you need to log in

7 answer(s)
D
Dimonchik, 2019-07-14
@dimonchik2013

not enough code
if the error is actual - see Chrome + F12

L
Lander, 2017-10-16
@usdglander

I had one story. It was necessary to parse photos from the site of a large one. The site was monitored (that is, admins / programmers are not down), so the usual curl get-requests after 100-200 pieces were blocked by IP. Next I bring our struggle as a list.
1. Blocking frequent requests - I set a delay. The speed drops significantly, and the number of requests does not increase significantly. I form a list of proxies, enter them into a script and set the proxy change when Gateway Timeout is reached.
2. They block all the proxies with which I worked (I didn’t find any new ones) - I analyze the names of the files and find out the algorithm for their formation (IDs and names were already parsed in my hands), thus I exclude requests to html and start downloading only pictures (for each there were several photos of the entity, each time different). As soon as 404 comes, I change the entity.
3. Blocking by IP continues, but now it's tolerable - I manage to download about 1000 photos per session, then change the IP on the router and hit the road again.
4. They remove ALL locks and return images for each request (It would seem that everything is fine), but after a while they begin to return a VERY strongly distorted image to this IP (BW + noise + twisted into a spiral).
Of course, I didn’t write an image analyzer anymore ... :)

V
Vyacheslav Uspensky, 2017-10-15
@Kwisatz

1. Why?
2. Nothing. If someone needs your content, then it will be sparsed. If you start blocking, then there are proxies, browser emulation and other things.

A
Alex-1917, 2017-10-16
@alex-1917

As you know, the best defense is an attack, so attack these evil Pinocchio with the curvature of the site and a buggy freezing server, the parser robots will go crazy)))
But seriously, this is the eternal key-master key problem, it all depends on the proportion of the value of your pictures against cost of fighting.
And it reminds the conversation of typical unfortunate businessmen sitting in a credited Mercedes-500, and torn shorts instead of pants:

- What are your turnovers, Vasya?
- I have 10 lyams a month, I don’t know grief, Fed!
- and I also get high, Vasya!
- Haven't you heard what profit is, Fed? It was on TV yesterday!
- No, Vasya, I didn’t hear! kill it! turnovers!!!
- yes, Fed!

Z
zim32, 2017-10-15
@zim32

You can run the javascript $(window).on('mousemove') and see if the user is moving the mouse )
The problem is that the content has already flown away.

D
Dimonchik, 2017-10-16
@dimonchik2013

with fail2ban
, there are protection methods that are quite simple,
for example, there are not so many https proxies, etc.

E
Evgen, 2017-10-16
@Verz1Lka

Cooler distil network has not come up with anything yet

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question