Answer the question
In order to leave comments, you need to log in
Scrapy - multiple pages into a single item
Hello,
you need to add to the item, the main information for which is obtained from the main response, additional fields that are in external links, for example .js, other urls.
Here is what I have achieved after 5 hours of reading the documentation and stackoverflow, here the google main page is parsed, and the page_size dictionary is added to item, which contains the url of the pages that are in google.com and their size in bytes.
{'page_size': [{'http://support.google.com/accounts/?hl=ru': 50526}]}
{'page_size': [{'http://support.google.com/accounts/?hl=ru': 50526},
{'http://www.google.com/intl/ru/policies/privacy/': 37644}]}
class TestSpider(CrawlSpider):
name = 'test'
allowed_domains = ['google.com', 'google.ru']
start_urls = ['https://www.google.com/',]
rules = (
Rule(SgmlLinkExtractor(allow=(r'https://www.google')),
callback='parse_item',
follow=False),
)
def get_page_size(self, response):
item = response.meta['item']
if 'page_size' not in item:
item['page_size'] = list()
item['page_size'].append({
response.url:
len(response.body)
})
yield item
def parse_item(self, response):
item = TestProduct()
doc = leaf.parse(response.body)
for url in doc('a'):
if 'href' not in url.attrib:
continue
url = url.attrib['href']
if url.find('http:') != 0:
continue
request = Request(url, callback=self.get_page_size)
request.meta['item'] = item
yield request
Answer the question
In order to leave comments, you need to log in
You can do this: add requests for getting the size one at a time - get_page_size calls the next get_page_size , etc., while something else is in the queue. The last one in the chain returns the element. "Queue" can be passed via meta.
I'm not an expert - perhaps there are better options.
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question