J
J
JRazor2014-06-20 21:54:43
Python
JRazor, 2014-06-20 21:54:43

Scrapy: how to get return code from any page?

Hello. Created an issue that could not be resolved. Scrapy stalls when you feed interesting URLs to it. I'll give an example:

from scrapy.spider import Spider
from scrapy.crawler import Crawler
from scrapy import signals
from scrapy.utils.project import get_project_settings
from twisted.internet import reactor
from settings import options
from urlparse import urlparse

class SpiderParse(Spider):
    good_address = []
    name = 'Spider'
    domains = ['adn.com', 'dnr.state.ak.us', 'criminalrecordcheck.info', 'riverbug.terapad.com', 'ala-ism.pansitan.net']
    allowed_domains = domains
    start_urls = ['http://'+domain for domain in domains]

    def parse(self, response):
        if response.url in self.start_urls:
            self.good_address.append(urlparse(response.url).netloc)

        print self.good_address

if __name__ == '__main__':
    options = {
        'CONCURRENT_ITEMS': 200,
        'USER_AGENT': 'Googlebot/2.1 (+http://www.google.com/bot.html)',
        'DOWNLOAD_DELAY': 0.5,
        'CONCURRENT_REQUESTS': 20,
    }

    spider = SpiderParse()
    settings = get_project_settings()
    settings.overrides.update(options)
    crawler = Crawler(settings)
    crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
    crawler.install()
    crawler.configure()
    crawler.crawl(spider)
    crawler.start()
    reactor.run()

If you run this script, it will start processing addresses and...stop at 3 or 4. Which, in general, is not good. How to get return code anyway?
PS I read this question, but it did not help me much (or I misunderstood something): stackoverflow.com/questions/9698372/scrapy-and-res...
PSS Domain "criminalrecordcheck.info" is not registered, but returns 200. Why?

Answer the question

In order to leave comments, you need to log in

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question