How to parse html pages and process it?

C

catana6662014-10-30 20:36:10

Python

catana666, 2014-10-30 20:36:10

There is a page in VK vk.com/go_in_zp?z=photo-50824015_344878304%2Falbum... you need to parse html and find the parse link cs624016.vk.me/v624016533/a226/owG51bJm59o.jpg . Tell me the code in c++ or python, if not difficult

Reply

Answer the question

In order to leave comments, you need to log in

4 answer(s)

A

Andrey K, 2014-10-30
@mututunus

$ pip install lxml

import urllib2
from lxml import html

data = urllib2.urlopen(url).read()
h = html.fromstring(data)
h.cssselect('.mv_actions a')[0].attrib['href']

T

throughtheether, 2014-10-30
@throughtheether

Keep the disgusting, crooked, but working python code:

from selenium import webdriver
import time
browser = webdriver.Firefox()
url='http://vk.com/go_in_zp?z=photo-50824015_344878304%2Falbum-50824015_00%2Frev'
browser.get(url)
time.sleep(5) # this is bad
img=browser.find_element_by_xpath('//a[@id="pv_photo"]/img')
print img.get_attribute('src')
browser.quit()

conclusion:
How this code can be changed:
1) replace the time.sleep(5) line with checking if the element is found (wait a second, check for the presence of the element, if it is not there, increase the counter and continue; when the counter reaches the maximum value - timeout)
2) replace selenium with phantom.js (so that the firefox window does not appear)
3) understand what happens when the browser loads the page and mimic this behavior using requests.
The third way, in my opinion, is the most labor-intensive and the most promising (in terms of solution speed).
UPD :
solution with requests:

import requests
from lxml.html import fromstring
url='http://vk.com/go_in_zp?z=photo-50824015_344878304%2Falbum-50824015_00%2Frev'
search_string=url[url.find('photo-')+len('photo-'):url.find('%2F')]
r=requests.get(url)
doc=fromstring(r.text)
xpath='//a[contains(@onclick, "%s")]/img' % search_string
print doc.xpath(xpath)[0].attrib['src']

T

Trrrrr, 2014-10-30
@Trrrrr

The easiest way is to use QTWebKit: https://qt-project.org/doc/qt-5/qwebframe.html#fin...

L

lPolar, 2014-10-31
@lPolar

The easiest in this regard for mastering for a beginner is grab ( https://pypi.python.org/pypi/grab/0.4.13 )
Take firefox + firebug, look at the source code of the page and look for the right piece. In firebug you pull out its xpath, then you can do it like this (python 3):

from grab import Grab
g = Grab()
sample_url = 'some_url'
xpath_part= 'some_xpath'
resp =  g.go(sample_url).body
result = resp.xpath(some_xpath).text()
print(result)