Answer the question
In order to leave comments, you need to log in
Parser bot in python - how to optimize?
Hey!
I am writing parsers in python + selenium + chromdriver for automatic ordering of goods on sites like mvideo, citilink, etc.
Faced with the fact that:
1) sites stop loading when the parser has made quite a few orders - perhaps they are automatically banned by services like botscanner
2) it is very difficult to make a universal parser for several sites
3) the slightest change in web markup leads to the need to edit the parser (changed tags, moved the searched element to a new block, etc.)
My questions:
1) Are there ways to make a more universal parser?
2) How to bypass the artificial brakes of my parser from sites? Proxies don't help. This is the biggest problem, some sites after about 50 orders (not in a row, but one or two every day) just take a very long time to load
Answer the question
In order to leave comments, you need to log in
1) It's all about the art (exactly) of writing XPath expressions. It's one thing just a valid (for a specific document) xpath, and another thing is xpath insensitive (up to certain limits) to page layout changes.
2) Keep logs. Keep track of all headers received from the server, response codes, timestamps, and the returned pages themselves. Then analyze the collected logs, try to identify some patterns. Carefully consider (in the logs) the moment when the server was still giving normal answers, after which it began to swear. What happened changed at that moment? How many requests were sent before this a) in a particular session; b) from a specific User-Agent; c) with a specific ip; d) for the previous minute/hour/day ? Some even number? 100/1000/1000000 ? And draw your own conclusions from this regarding the formal criteria for a ban on the server.
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question