B
B
Bjornie2017-10-04 16:13:39
Python
Bjornie, 2017-10-04 16:13:39

What is the best and fastest way to parse Amazon in Python?

I'm writing an Amazon product parser. Parsing exclusively static html pages , i.e. it is not supposed to be parsed either through AJAX, let alone dynamics (for example, Selenium). The page is interested in some text fields (prices, shipping, etc.). Because there are a lot of products, and Amazon has a lot of protection against parsing, then I had a question about the right choice of libraries to create a reliable parser that can work through a proxy and do it quickly.
I have already partially coded BeautifulSoup (lxml) + requests (with proxy list) + Random UA , but I feel that somehow it does not work very fast. Should I look at other libraries? Let me know if anyone has had a similar experience. Should I use Scrapy or something else for this?
Or if you do it according to the specified stack, then what features of the language do you recommend paying attention to in order to speed up the work of the parser?

Answer the question

In order to leave comments, you need to log in

3 answer(s)
P
polarlord, 2017-10-05
@Bjornie

I parse Amazon on an industrial scale (hundreds of thousands of pages a day). The biggest problem is not with the libraries, but that Amazon is very good at detecting parsing attempts and at the same time constantly improving its own technique for detecting such attempts. Therefore, the most effective way is to have at your disposal a decent set of high-quality proxies (with those that differ only in the last section and port number, you won’t be able to work for a long time - they will be blacklisted for a period of an hour to a day, depending on how intensively you will through them to send requests).
As for libraries, choose them according to your needs, based on the volume of requests that you need to send. The simplest are all sorts of requests , urllib ,pycurl , multicurl . It is expedient to use them in single-threaded and synchronous type of parsers. But almost all the work will be written by hand. If you want a little more power and convenience - look towards Grab . He can do a lot, incl. conveniently works with proxies, etc. If you need a lot of volume and speed - use Scrapy . Cool thing, but with its own rules. However, if you need to sharpen it for yourself - there is a lot of information on the network on it.
You can and should work with the Amazon API. But there are several problems:
1. There is a limit on the number of requests ( more details here , but you can send up to 10 ASINs in one request).
2. The most unpleasant thing is that for some products (when using lookup methods) infa does not come or differs from the original (website). Those. do not rely on the API to return information that is completely identical to their site.
3. Restriction on the number of products for which info is returned (when using search methods). 100 items. Further - only parsing. This restriction is not only for Amazon, for Ebay as well. Without this, the number of all sorts of dropshippers and other intermediaries would simply go off scale.
A few nuances:
-Do not try to impersonate Google Bot, nothing good will come of it, just waste your time.
-Using all sorts of browser technologies like PhantomJS or even Selenium,it won't make any sense. There, cookies, etc. will be added to the IP problem. It will be slow in speed, not suitable for large volumes.
-The main thing, as it is already clear, is to bypass the system that detects bots and crawlers. Therefore, improvise, experiment, think with your head and look for your own solutions. There are also people on the other end) There are a lot of tips on the net about this (you can start from the last section here ).

W
warnerbrowsers, 2017-10-04
@warnerbrowsers

Here is an example of Amazon parsing on Scrapy, maybe it will come in handy.
blog.datahut.co/tutorial-how-to-scrape-amazon-usin...

E
Evgeny, 2020-04-24
Moykin @moykin_e

A-parser is a cool solution for Amazon parsing .

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question