How to access Ajax content while parsing?

R

Ruslan Ezhgurov2015-10-28 09:47:06

Python

Ruslan Ezhgurov, 2015-10-28 09:47:06

We need to collect data from a block that is loaded using ajax.
get_html(url) standard function returns content without ajax block

def get_html(url):
  response = urllib.request.urlopen(url)
  return response.read()

def parse(html):
  soup = BeautifulSoup(html)
  div = soup.find('div', id='tz-bl-hidden-two')
  print(div)

as a result I get:

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

N

nirvimel, 2015-10-28
@Oyaseo

In general, using html parsing, it is not possible to get the page in the form in which it is brought by its own javascripts when loaded, because the parser is not a browser , it does not execute javascripts.
In special cases, you can remove all the urls that go through ajax requests from the script text, make all these requests in your code and parse the results. There are a lot of pitfalls here - firstly, the ajax request parameters can be hidden in the code in some non-trivial way, secondly, you need to correctly set all request headers with all cookies (which scripts from the page can also manipulate), then do not forget correctly set the referrer. In the general case, the scripts on the page always have the opportunity, using some dynamically changing parameters, to confuse their work so that it will be impossible to create a parser for such a page.
A radically different option is to use a real browser (through Sillentium, for example), which executes all scripts and, from the point of view of the opposite side, is indistinguishable from a live user. This solves all the tricky ajax problems. But this is a completely different order of the amount of resources consumed and speed. If, for example, on the cheapest vps (with 128 MB of memory) on a gigabit channel, you can parse in 50-100 threads. Even at the rate of several seconds for waiting + processing of each page, we get 10-20 scattered pages per second. Now if you switch to Sillentium + Webkit, then 128 MB is no longer enough to run even one thread. Even if you run all this on your home desktop with gigabytes of memory (with vps as a proxy), you can get a maximum of several scattered pages per second.

A

angru, 2015-10-28
@angru

Take the browser, open the developer tool (F12 usually), the Networking tab, put a filter on XHR requests, refresh the page, if you need to click somewhere to execute Ajax - click. Everything that needs to be displayed on the query panel, there is also all the necessary information on the request (headers, parameters, response), study the api and make the same requests from python yourself.