Why doesn't Scrapy save data?

T

tispoint2016-04-19 13:21:57

Scrapy

tispoint, 2016-04-19 13:21:57

Good afternoon.
I'm running the script. It can be seen that it goes through all the necessary pages and does not go where it is not needed - everything seems to be fine, but the output to CSV is not produced - the file length is 0. What could be the ambush?

# -*- coding: utf-8 -*-
from scrapy.spiders import Rule, CrawlSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.linkextractors import LinkExtractor

from intelxeon.items import IntelxeonItem

from scrapy import Selector

class IntelXeon_Spider(CrawlSpider):  #

    name = 'intelxeon_spider'

    allowed_domains = ['ark.intel.com']
    start_urls = ['http://ark.intel.com/products/family/78581',
  'http://ark.intel.com/products/family/78585',
  ]



    rules = [
            Rule
                 (
                 LinkExtractor(deny=('/content/',
                                      )),
                 follow = False,
                 ),


            Rule
                 (
                  LinkExtractor(allow=('/producr/',
                                       )),
                  callback = 'parse_page',
                  follow = True,
                ),

             ]


    def parse_page(self, response):

        hxs = Selector(response)
        item = IntelxeonItem()
        try:
            item['url'] =  response.request.url
            item['title'] =  hxs.xpath('//h1/text()').extract()


        except:
            item['url'] =  response.request.url

        return item

I run scrapy crawl intelxeon_spider -o xeon.csv -t csv
and silence in response :-)

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

D

Dimonchik, 2016-04-19
@dimonchik2013

1) check that everything required is parsed/located => see log
2) check the file permissions, specify the path to the file directly (with the drive name, all directories must exist)
there are copyright symbols in H1, there may be misunderstandings with the encoding - first check if it is found on some meta tag, for example

T

tispoint, 2016-04-19
@tispoint

In the log, to us, in my opinion, everything is fine:

2016-04-19 17:12:36 [scrapy] INFO: Scrapy 1.0.1 started (bot: intelxeon)
2016-04-19 17:12:36 [scrapy] INFO: Optional features available: ssl, http11
2016-04-19 17:12:36 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'intelxeon.spiders', 'FEED_URI': 'xeon.csv', 'SPIDER_MODULES': ['intelxeon.spiders'], 'BOT_NAME': 'intelxeon', 'LOG_STDOUT': True, 'FEED_FORMAT': 'csv', 'LOG_FILE': 'C:/log.txt'}
2016-04-19 17:12:36 [scrapy] INFO: Enabled extensions: CloseSpider, FeedExporter, TelnetConsole, LogStats, CoreStats, SpiderState
2016-04-19 17:12:37 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-04-19 17:12:37 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-04-19 17:12:37 [scrapy] INFO: Enabled item pipelines: 
2016-04-19 17:12:37 [scrapy] INFO: Spider opened
2016-04-19 17:12:37 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-04-19 17:12:37 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-04-19 17:12:38 [scrapy] DEBUG: Crawled (200) <GET http://ark.intel.com/products/family/78581> (referer: None)
2016-04-19 17:12:39 [scrapy] DEBUG: Crawled (200) <GET http://ark.intel.com/products/family/78581> (referer: http://ark.intel.com/products/family/78581)
...
2016-04-19 17:12:40 [scrapy] INFO: Closing spider (finished)
2016-04-19 17:12:40 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 16136,
 'downloader/request_count': 51,
 'downloader/request_method_count/GET': 51,
 'downloader/response_bytes': 697437,
 'downloader/response_count': 51,
 'downloader/response_status_count/200': 51,
 'dupefilter/filtered': 8,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 4, 19, 13, 12, 40, 99000),
 'log_count/DEBUG': 53,
 'log_count/INFO': 7,
 'request_depth_max': 1,
 'response_received_count': 51,
 'scheduler/dequeued': 51,
 'scheduler/dequeued/memory': 51,
 'scheduler/enqueued': 51,
 'scheduler/enqueued/memory': 51,

If you specify -oc:\xeon.csv for output on the command line, then the following error appears:
2016-04-19 17:14:33 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'intelxeon.spiders', 'FEED_URI': 'c:\xeon.csv', 'SPIDER_MODULES': ['intelxeon.spiders'], 'BOT_NAME': 'intelxeon', 'LOG_STDOUT': True, 'FEED_FORMAT': 'csv', 'LOG_FILE' : 'C:/log.txt'}
2016-04-19 17:14:33 [scrapy] ERROR: Unknown feed storage scheme: c
As for the fields - I removed everything for the purity of the experiment, leaving only
item['url'] = response.request.url
is it supposed to be displayed anyway?