How to solve the problem when parsing sites using the Scrapy framework?

A

Alica2013-12-08 17:13:22

Scrapy

Alica, 2013-12-08 17:13:22

Good afternoon! I am solving the following problem: there is a news archive, for example, www.fontanka.ru/fontanka/arc/news.html. It is necessary to extract all the articles for all time and write to the database. With the help of scrapy shell, I managed to solve this problem, but I couldn’t write spider.
When I'm working in the shell, part of the program looks like this:

n = 0 #количество статей 
data = "/2013/02/13"
while(n <= 10000):
       fetch(site + data + "/news.html")
       list_site = sel.xpath('//a[contains(@class, pattern)]/@href')
       for i in list_site:
            #извлекаем содержимое i
            #записываем в базу данных
            n = n + 1
       data = #выбираем следующую дату

How to organize such a structure without using the shell? There was an attempt to use Request, but such nesting could not be created.
Thanks in advance!

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

A

Alica, 2014-01-16
@Alica

The problem was solved using the grab python library.

E

ehles, 2014-01-26
@ehles

I would do it something like this:

class my_spider(CrawlSpider):
    name = "fontanka"
    allowed_domains = ["fontanka.ru"]
    start_urls = ["http://www.fontanka.ru/fontanka/arc/news.html"]
    base_address = "http://www.fontanka.ru/"

    def parse(self, response):
        """
        Парсим стартовую страничку, 
        находим ссылки на другие странички        
        """
        for date in (date1, ....):
            # выбираем следующую дату
            site = self.base_address
            url = "%(site)s/%(date)s/news.html" % {
                "site": site,
                "date": date,
            }        
            request = Request(url, callback = self.parse_page)
            yield request

    def parse_page(self, response):
        """
        Парсим каждую загруженную страничку.
        """
        list_site = sel.xpath('//a[contains(@class, pattern)]/@href')
        for i in list_site:
            #извлекаем содержимое i
            #записываем в базу данных тут или в pipelines
            # http://doc.scrapy.org/en/latest/topics/item-pipeline.html

PS Consider writing to the database through the item pipeline