I
I
Igor Tkachenko2014-06-11 11:32:17
PHP
Igor Tkachenko, 2014-06-11 11:32:17

How to parse news from news.yandex and mail.ru in php

How to parse news from news.yandex.ru, according to your request, for example -

http://news.yandex.ru/yandsearch?grhow=clutop&text=какой то запрос&rpt=nnews2&p=0
?
The trouble is that Yandex asks for captcha with frequent requests, tell me what to do? If any ways.
More options with mail.ru
will do. I tried using curl and parsing through regular expressions.

Answer the question

In order to leave comments, you need to log in

4 answer(s)
X
xfenix, 2014-06-11
@xfenix

Just recently I wrote a spider for Yandex news on scrapy and I use it quite successfully.
Alas, my volumes are small, so I settled on manual input:

body = html.fromstring(response.body)
# extract params
captcha = body.xpath('//*[@class="b-captcha__image"]/@src')[0]
key = body.xpath('//input[contains(@name, "key")]/@value')[0]
returl = body.xpath('//input[contains(@name, "retpath")]/@value')[0]
self.retpath = returl
# download captcha
try:
    os.remove(CAPTCHA_FILE)
except:
    pass
urllib.urlretrieve(captcha, CAPTCHA_FILE)
# show captcha
img = Image.open(CAPTCHA_FILE)
img.show()
# get captcha value
captcha_value = raw_input('Put captcha in manually> ')

P
Push Pull, 2014-06-11
@deadbyelpy

how frequent requests are needed? cache data on your side, update every 5 minutes, and you will be happy.

K
kmx, 2014-06-11
@kmx

Have you tried parsing from Yandex via RSS and proxies?
rolled two years ago.

O
Optimus, 2014-06-11
Pyan @marrk2

And whoever noticed if he threw out the captcha and there are no more requests from this IP Yandex removes the captcha after a while?

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question