B
B
BogBel2015-12-19 20:51:43
Python
BogBel, 2015-12-19 20:51:43

How to parse raw html pages?

Good day, you need to parse data from the site, read a bunch of docks, all of them are used to sending a request and processing response data.
For example, I found this one:

html = urllib2.urlopen( "http://www.google.com" ).read()
soup = BeautifulSoup( html )

Everything is nice, but I'm not getting the data I wanted.
For example, I open it with an inspector and now I get this data:
1dc14e9d989b4aef96d45daa11e6fcf6.JPG
but instead of them I would like to receive
62782b4f6a8c4b71988d406e230d33c3.JPG
That is, the question is to get data not from the result of the response object, but directly to collect the page content.
I found the solution for myself in this:
from mechanize import Browser
from bs4 import BeautifulSoup
mech = Browser()
mech.set_handle_robots(False)
url = 'example.com'
page1 = mech.open(url)
html1 = page1.read()
soup1 = BeautifulSoup(html1)

Answer the question

In order to leave comments, you need to log in

3 answer(s)
R
Roman Kitaev, 2015-12-19
@deliro

This is just processed data, and raw comes to you with urlib.
Use selenium (browser window can be hidden using PhantomJS).

N
nirvimel, 2015-12-19
@nirvimel

selenium

H
hdworker, 2015-12-19
@hdworker

For pages generated on the pycurl
server For ajax pages requesting information from the HtmlUnit server

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question