Scrapy - multiple pages into a single item

X

xwild2013-04-14 18:15:55

Python

xwild, 2013-04-14 18:15:55

Hello,
you need to add to the item, the main information for which is obtained from the main response, additional fields that are in external links, for example .js, other urls.

Here is what I have achieved after 5 hours of reading the documentation and stackoverflow, here the google main page is parsed, and the page_size dictionary is added to item, which contains the url of the pages that are in google.com and their size in bytes.

{'page_size': [{'http://support.google.com/accounts/?hl=ru': 50526}]}

{'page_size': [{'http://support.google.com/accounts/?hl=ru': 50526},
               {'http://www.google.com/intl/ru/policies/privacy/': 37644}]}

etc.

That is, as a result, all steps for updating item to the final state are returned, I only need the last iteration, and it would be even better to receive all the information at the end of the parse_item method.
Is there any normal way to do this?

Thanks in advance, below is the code in question.

class TestSpider(CrawlSpider):
    name = 'test'
    allowed_domains = ['google.com', 'google.ru']
    start_urls = ['https://www.google.com/',]
    rules = (
                Rule(SgmlLinkExtractor(allow=(r'https://www.google')),
                                       callback='parse_item',
                                       follow=False),
            )

def get_page_size(self, response):
    item = response.meta['item']
    if 'page_size' not in item:
        item['page_size'] = list()

    item['page_size'].append({
                              response.url:
                              len(response.body)
                            })
    yield item

def parse_item(self, response):
    item = TestProduct()
    doc = leaf.parse(response.body)

    for url in doc('a'):
        if 'href' not in url.attrib:
            continue

        url = url.attrib['href']
        if url.find('http:') != 0:
            continue

        request = Request(url, callback=self.get_page_size)
        request.meta['item'] = item
        yield request

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

K

kmike, 2013-04-19
@kmike

You can do this: add requests for getting the size one at a time - get_page_size calls the next get_page_size , etc., while something else is in the queue. The last one in the chain returns the element. "Queue" can be passed via meta.
I'm not an expert - perhaps there are better options.