Answer the question
In order to leave comments, you need to log in
How to implement serial or parallel launch of spiders in Scrapy?
Hello. There was a problem: I can not run the scripts in order. The reason is the reactor. For some reason it won't restart. One spider - one file. The following is written at the end of each file:
spider = %текущий класс паука%()
settings = get_project_settings()
settings.overrides.update(options)
crawler = Crawler(settings)
# crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.install()
crawler.configure()
crawler.crawl(spider)
crawler.signals.connect(crawler.uninstall, signal=signals.spider_closed)
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.start()
log.start(logfile=logfile, loglevel=log.DEBUG, crawler=crawler, logstdout=False)
reactor.run()
crawler._spider_closed()
print "Closed spider %имя паука%"
import %следующий паук%
Traceback (most recent call last):
File "C:/Users/Eugene/ODesk/450/spiders/__init__.py", line 1, in <module>
import newenglandfilm
File "C:\Users\Eugene\ODesk\450\spiders\newenglandfilm.py", line 60, in <module>
import mandy
File "C:\Users\Eugene\ODesk\450\spiders\mandy.py", line 68, in <module>
reactor.run()
File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 1191, in run
self.startRunning(installSignalHandlers=installSignalHandlers)
File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 1171, in startRunning
ReactorBase.startRunning(self)
File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 683, in startRunning
raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable
Answer the question
In order to leave comments, you need to log in
I already realized that the reactor does not need to be restarted. You need to run one reactor. Everything was resolved as follows:
#!/usr/bin/python
# -*- coding: utf-8 -*-
from scrapy.crawler import Crawler
from scrapy.utils.project import get_project_settings
from twisted.internet import reactor
# Импортирем пауков
from spiders.newenglandfilm import NewenglandFilm
from spiders.mandy import Mandy
from spiders.productionhub import ProductionHub
from spiders.craiglist import Craiglist
from spiders.my_settings import options
# Передаем настройки
settings = get_project_settings()
settings.overrides.update(options)
# Запускаем четыре паука по очереди
crawler = Crawler(settings)
crawler.configure()
crawler.crawl(NewenglandFilm())
crawler.start()
crawler = Crawler(settings)
crawler.configure()
crawler.crawl(Mandy())
crawler.start()
crawler = Crawler(settings)
crawler.configure()
crawler.crawl(ProductionHub())
crawler.start()
crawler = Crawler(settings)
crawler.configure()
crawler.crawl(Craiglist())
crawler.start()
# Запускаем реактор
reactor.run()
The reactor is like the main (eternal) loop in a Twisted application, and it doesn't have to restart.
To be honest, I don’t understand why you need this particular way to launch spiders. If you need the spiders to work for you in order, then you can install scrapyd, and then write tasks for it in the cron, that is, register curl calls with certain urls with post parameters.
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question