Answer the question
In order to leave comments, you need to log in
How to solve the problem when parsing sites using the Scrapy framework?
Good afternoon! I am solving the following problem: there is a news archive, for example, www.fontanka.ru/fontanka/arc/news.html. It is necessary to extract all the articles for all time and write to the database. With the help of scrapy shell, I managed to solve this problem, but I couldn’t write spider.
When I'm working in the shell, part of the program looks like this:
n = 0 #количество статей
data = "/2013/02/13"
while(n <= 10000):
fetch(site + data + "/news.html")
list_site = sel.xpath('//a[contains(@class, pattern)]/@href')
for i in list_site:
#извлекаем содержимое i
#записываем в базу данных
n = n + 1
data = #выбираем следующую дату
Answer the question
In order to leave comments, you need to log in
I would do it something like this:
class my_spider(CrawlSpider):
name = "fontanka"
allowed_domains = ["fontanka.ru"]
start_urls = ["http://www.fontanka.ru/fontanka/arc/news.html"]
base_address = "http://www.fontanka.ru/"
def parse(self, response):
"""
Парсим стартовую страничку,
находим ссылки на другие странички
"""
for date in (date1, ....):
# выбираем следующую дату
site = self.base_address
url = "%(site)s/%(date)s/news.html" % {
"site": site,
"date": date,
}
request = Request(url, callback = self.parse_page)
yield request
def parse_page(self, response):
"""
Парсим каждую загруженную страничку.
"""
list_site = sel.xpath('//a[contains(@class, pattern)]/@href')
for i in list_site:
#извлекаем содержимое i
#записываем в базу данных тут или в pipelines
# http://doc.scrapy.org/en/latest/topics/item-pipeline.html
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question