Which Python framework to choose for a web scraping system?

D

danSamara2018-03-13 13:41:51

Python

danSamara, 2018-03-13 13:41:51

I am developing a system for constant parsing of sites. At the initial stage, there will be several dozen sites, in the future - hundreds.
1. Key features of the system that complicate the choice of framework, inclining me to write my own bike

The presence of the "kernel" of the system, which allows you to dynamically connect / disconnect spiders without restarting the entire service with the ability to monitor the operation of these spiders. Or the presence of an API, on which it will be possible to "turn the face".
Splitting the spider into "fetcher": parsing lists and obtaining "raw" (raw html) documents for parsing, saving the latter in the database and "parser": converting "raw" pages into structured data, with the possibility of a separate launch of the "parser". One of the main "features". The requirements for the parser may change and it will be necessary to re-parse all the documents of the site again, of which there may be hundreds of thousands. Of course, without re-downloading them.
Centralized storage of "raw" and parsed data.
Distribution - the ability to run spiders on separate nodes. Since this requirement greatly complicates the previous requirement, we can simplify it - just the ability to use a proxy.
Schedule - launch by time (every hour, day...) of both spiders and specific tasks in these spiders, including an indication of one-time parsing. Example: the site has a sitemap.xml containing links to other sitemaps: sitemap-2016.xml, sitemap-2017.xml, sitemap-2018.xml; Obviously, for 2016 and 2017, one fetcher pass is sufficient, while 2018 must be periodically reviewed, once a day, for example.
Priorities - the ability to specify a priority for an individual spider.
Caching - support for Cache-Control headers and manual setting: do not cache / cache temporarily / cache by Cache-Control header

2. Non-critical Wishlist that you can give up

Using asyncio. This part of the Python standard library has already become quite "settled" and, in my opinion, is slowly becoming the de facto standard for asynchronous programming in Python.
A simple deployment of a spider - added a new spider to the server, went to the "admin panel" on the site, turned it on. We look at the results, we look at the logs. Lowered priority. Disabled.

3. Tools that I have reviewed so far
In principle, the choice is not great and fits into three main ones:

PySpider
A great tool, I often use it when I need to get data from a site. Unfortunately, some features make it problematic to use:
- It is not clear how to divide into "fetcher" and "parser", or rather, how to launch a separate launch of only the "parser".
- No native proxy support
- The spider code is edited in the web interface and does not imply use from a file. It is possible to launch a separate spider instance with a file, but this option is not suitable - hundreds of PySpider instances kill the very philosophy of using this framework. And it’s not clear how it’s convenient to debug all this in the IDE.
Grab , more specifically Grab:spider
A very interesting lightweight framework, but it seems to be missing:
- Ability to centrally manage spiders
- Schedule for launching individual tasks
- Possibility to carry out a separate launch of only the "parser".
Scrapy
The most famous and widely promoted tool for writing spiders in Python. Questions to him are roughly similar to questions on Grab:spider:
- Schedule for launching individual tasks
- Possibility to carry out a separate launch of only the "parser".

In total, what would you advise: write your own solution or try to use the framework?

Reply

Answer the question

In order to leave comments, you need to log in

5 answer(s)

D

Dimonchik, 2018-03-14
@dimonchik2013

pyspider liked the interface for a very long time, until I tried all the parts of scrapy

I

iSergios, 2018-03-23
@iSergios

It seems to me that you have not understood the topic, start over.
None of these features complicate the choice of a framework, because none of them are covered and should not be covered by its functionality.
Any scraping framework is a fishing rod. Rybak to write to you yourself. And it is not the fishing rod that should decide how often and with what frequency to launch, where to store the accumulated and everything else. For the purposes of scraping, you should have only one question: should you parse JS or not. If not - your choice is BeautifulSoup, because it is very fast. If yes, look towards Selenium.

D

Dmitry, 2018-03-13
@EvilsInterrupt

Scrappy
Of the minuses:
* Complexity of installation on a Windows system. Therefore, once you will need to put . Document the installation process.
* I had problems with encodings, but it's possible I had something with my hands. Pay attention to this
From the pros:
* Known to many
* Structured
* A lot of information on it

M

Mikhail Sisin, 2018-03-14
@JabbaHotep

Writing your own framework from scratch is quite a difficult task. He himself participated in the development of 1 Perl framework, 2 in Python and one in Ruby and one more in Go (all proprietary) :) However, it gives you the opportunity to build any architecture to suit your needs. This makes sense if the volumes are large - hundreds and thousands of parsers and the architecture of existing frameworks does not suit.
Points 3 and 4 do not contradict each other in any way, you store the data centrally in the database. You launch tasks distributed through the task management system (workers that launch parsers can be located on different hosts). Proxies must be mandatory, regardless of the degree of distribution.
Regarding the desire to run only the parsing part, I'm not sure if this is possible out of the box, but I can suggest a workaround. 2 scrapers are being written - one crawler, the second parser that parses local pages.

S

Sergey, 2018-03-22
@seriyPS

You can try to use Scrapy as a fetcher and then throw raw pages into some kind of queue like RabbitMQ or Kafka.
The good thing about Scrapy is that it's very modular (at least it was when I last used it). If you don't like the built-in queue scheduler, replace it with your own. If you don't like how it works with headers / proxies / caching - you add your own middleware.
The main complaint in my case was the single threading and complexity of Twisted. When they started to run into performance, they simply rewrote it to Erlang. But overall I liked the experience.