Answer the question
In order to leave comments, you need to log in
Why doesn't Scrapy save data?
Good afternoon.
I'm running the script. It can be seen that it goes through all the necessary pages and does not go where it is not needed - everything seems to be fine, but the output to CSV is not produced - the file length is 0. What could be the ambush?
# -*- coding: utf-8 -*-
from scrapy.spiders import Rule, CrawlSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.linkextractors import LinkExtractor
from intelxeon.items import IntelxeonItem
from scrapy import Selector
class IntelXeon_Spider(CrawlSpider): #
name = 'intelxeon_spider'
allowed_domains = ['ark.intel.com']
start_urls = ['http://ark.intel.com/products/family/78581',
'http://ark.intel.com/products/family/78585',
]
rules = [
Rule
(
LinkExtractor(deny=('/content/',
)),
follow = False,
),
Rule
(
LinkExtractor(allow=('/producr/',
)),
callback = 'parse_page',
follow = True,
),
]
def parse_page(self, response):
hxs = Selector(response)
item = IntelxeonItem()
try:
item['url'] = response.request.url
item['title'] = hxs.xpath('//h1/text()').extract()
except:
item['url'] = response.request.url
return item
Answer the question
In order to leave comments, you need to log in
1) check that everything required is parsed/located => see log
2) check the file permissions, specify the path to the file directly (with the drive name, all directories must exist)
there are copyright symbols in H1, there may be misunderstandings with the encoding - first check if it is found on some meta tag, for example
In the log, to us, in my opinion, everything is fine:
2016-04-19 17:12:36 [scrapy] INFO: Scrapy 1.0.1 started (bot: intelxeon)
2016-04-19 17:12:36 [scrapy] INFO: Optional features available: ssl, http11
2016-04-19 17:12:36 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'intelxeon.spiders', 'FEED_URI': 'xeon.csv', 'SPIDER_MODULES': ['intelxeon.spiders'], 'BOT_NAME': 'intelxeon', 'LOG_STDOUT': True, 'FEED_FORMAT': 'csv', 'LOG_FILE': 'C:/log.txt'}
2016-04-19 17:12:36 [scrapy] INFO: Enabled extensions: CloseSpider, FeedExporter, TelnetConsole, LogStats, CoreStats, SpiderState
2016-04-19 17:12:37 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-04-19 17:12:37 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-04-19 17:12:37 [scrapy] INFO: Enabled item pipelines:
2016-04-19 17:12:37 [scrapy] INFO: Spider opened
2016-04-19 17:12:37 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-04-19 17:12:37 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-04-19 17:12:38 [scrapy] DEBUG: Crawled (200) <GET http://ark.intel.com/products/family/78581> (referer: None)
2016-04-19 17:12:39 [scrapy] DEBUG: Crawled (200) <GET http://ark.intel.com/products/family/78581> (referer: http://ark.intel.com/products/family/78581)
...
2016-04-19 17:12:40 [scrapy] INFO: Closing spider (finished)
2016-04-19 17:12:40 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 16136,
'downloader/request_count': 51,
'downloader/request_method_count/GET': 51,
'downloader/response_bytes': 697437,
'downloader/response_count': 51,
'downloader/response_status_count/200': 51,
'dupefilter/filtered': 8,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 4, 19, 13, 12, 40, 99000),
'log_count/DEBUG': 53,
'log_count/INFO': 7,
'request_depth_max': 1,
'response_received_count': 51,
'scheduler/dequeued': 51,
'scheduler/dequeued/memory': 51,
'scheduler/enqueued': 51,
'scheduler/enqueued/memory': 51,
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question