Answer the question
In order to leave comments, you need to log in
Which Python framework to choose for a web scraping system?
I am developing a system for constant parsing of sites. At the initial stage, there will be several dozen sites, in the future - hundreds.
1. Key features of the system that complicate the choice of framework, inclining me to write my own bike
Answer the question
In order to leave comments, you need to log in
pyspider liked the interface for a very long time, until I tried all the parts of scrapy
It seems to me that you have not understood the topic, start over.
None of these features complicate the choice of a framework, because none of them are covered and should not be covered by its functionality.
Any scraping framework is a fishing rod. Rybak to write to you yourself. And it is not the fishing rod that should decide how often and with what frequency to launch, where to store the accumulated and everything else. For the purposes of scraping, you should have only one question: should you parse JS or not. If not - your choice is BeautifulSoup, because it is very fast. If yes, look towards Selenium.
Scrappy
Of the minuses:
* Complexity of installation on a Windows system. Therefore, once you will need to put . Document the installation process.
* I had problems with encodings, but it's possible I had something with my hands. Pay attention to this
From the pros:
* Known to many
* Structured
* A lot of information on it
Writing your own framework from scratch is quite a difficult task. He himself participated in the development of 1 Perl framework, 2 in Python and one in Ruby and one more in Go (all proprietary) :) However, it gives you the opportunity to build any architecture to suit your needs. This makes sense if the volumes are large - hundreds and thousands of parsers and the architecture of existing frameworks does not suit.
Points 3 and 4 do not contradict each other in any way, you store the data centrally in the database. You launch tasks distributed through the task management system (workers that launch parsers can be located on different hosts). Proxies must be mandatory, regardless of the degree of distribution.
Regarding the desire to run only the parsing part, I'm not sure if this is possible out of the box, but I can suggest a workaround. 2 scrapers are being written - one crawler, the second parser that parses local pages.
You can try to use Scrapy as a fetcher and then throw raw pages into some kind of queue like RabbitMQ or Kafka.
The good thing about Scrapy is that it's very modular (at least it was when I last used it). If you don't like the built-in queue scheduler, replace it with your own. If you don't like how it works with headers / proxies / caching - you add your own middleware.
The main complaint in my case was the single threading and complexity of Twisted. When they started to run into performance, they simply rewrote it to Erlang. But overall I liked the experience.
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question