How to parse raw html pages?

B

BogBel2015-12-19 20:51:43

Python

BogBel, 2015-12-19 20:51:43

Good day, you need to parse data from the site, read a bunch of docks, all of them are used to sending a request and processing response data.
For example, I found this one:

html = urllib2.urlopen( "http://www.google.com" ).read()
soup = BeautifulSoup( html )

Everything is nice, but I'm not getting the data I wanted.
For example, I open it with an inspector and now I get this data:

but instead of them I would like to receive

That is, the question is to get data not from the result of the response object, but directly to collect the page content.
I found the solution for myself in this:

from mechanize import Browser
from bs4 import BeautifulSoup
mech = Browser()
mech.set_handle_robots(False)
url = 'example.com'
page1 = mech.open(url)
html1 = page1.read()
soup1 = BeautifulSoup(html1)

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

R

Roman Kitaev, 2015-12-19
@deliro

This is just processed data, and raw comes to you with urlib.
Use selenium (browser window can be hidden using PhantomJS).

N

nirvimel, 2015-12-19
@nirvimel

selenium

H

hdworker, 2015-12-19
@hdworker

For pages generated on the pycurl
server For ajax pages requesting information from the HtmlUnit server