J
J
JRazor2014-03-02 16:53:07
Python
JRazor, 2014-03-02 16:53:07

How to implement serial or parallel launch of spiders in Scrapy?

Hello. There was a problem: I can not run the scripts in order. The reason is the reactor. For some reason it won't restart. One spider - one file. The following is written at the end of each file:

spider = %текущий класс паука%()
settings = get_project_settings()
settings.overrides.update(options)
crawler = Crawler(settings)
# crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.install()
crawler.configure()
crawler.crawl(spider)
crawler.signals.connect(crawler.uninstall, signal=signals.spider_closed)
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.start()
log.start(logfile=logfile, loglevel=log.DEBUG, crawler=crawler, logstdout=False)
reactor.run()

crawler._spider_closed()
print "Closed spider %имя паука%"
import %следующий паук%

But an error pops up:
Traceback (most recent call last):
  File "C:/Users/Eugene/ODesk/450/spiders/__init__.py", line 1, in <module>
    import newenglandfilm
  File "C:\Users\Eugene\ODesk\450\spiders\newenglandfilm.py", line 60, in <module>
    import mandy
  File "C:\Users\Eugene\ODesk\450\spiders\mandy.py", line 68, in <module>
    reactor.run()
  File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 1191, in run
    self.startRunning(installSignalHandlers=installSignalHandlers)
  File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 1171, in startRunning
    ReactorBase.startRunning(self)
  File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 683, in startRunning
    raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable

Answer the question

In order to leave comments, you need to log in

2 answer(s)
J
JRazor, 2014-03-02
@JRazor

I already realized that the reactor does not need to be restarted. You need to run one reactor. Everything was resolved as follows:

#!/usr/bin/python
# -*- coding: utf-8 -*-

from scrapy.crawler import Crawler
from scrapy.utils.project import get_project_settings
from twisted.internet import reactor

# Импортирем пауков
from spiders.newenglandfilm import NewenglandFilm
from spiders.mandy import Mandy
from spiders.productionhub import ProductionHub
from spiders.craiglist import Craiglist

from spiders.my_settings import options

# Передаем настройки
settings = get_project_settings()
settings.overrides.update(options)

# Запускаем четыре паука по очереди
crawler = Crawler(settings)
crawler.configure()
crawler.crawl(NewenglandFilm())
crawler.start()

crawler = Crawler(settings)
crawler.configure()
crawler.crawl(Mandy())
crawler.start()

crawler = Crawler(settings)
crawler.configure()
crawler.crawl(ProductionHub())
crawler.start()

crawler = Crawler(settings)
crawler.configure()
crawler.crawl(Craiglist())
crawler.start()

# Запускаем реактор
reactor.run()

R
Rinat Akhtamov, 2014-03-02
@rinnaatt

The reactor is like the main (eternal) loop in a Twisted application, and it doesn't have to restart.
To be honest, I don’t understand why you need this particular way to launch spiders. If you need the spiders to work for you in order, then you can install scrapyd, and then write tasks for it in the cron, that is, register curl calls with certain urls with post parameters.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question