A
A
Alexey2016-04-11 17:19:09
Python
Alexey, 2016-04-11 17:19:09

How to make a parser in python, given that the page transition is carried out in javascript?

Good afternoon!
I just started to disassemble python and don't know much about it.
Faced such a problem using the article gis-lab.info/qa/scrapy.html as an example, I made a parser, but when I started to remake it for my needs, namely:
Make a site parser https://bankrot.fedresurs.ru/ArbitrManagersList.aspx to get all FIO and URL saw that all pagination is done using the ajax request method. How can I make a parser using python? Thanks in advance!

Answer the question

In order to leave comments, you need to log in

2 answer(s)
N
nirvimel, 2016-04-11
@MrSen

In this case, it is enough to set the AmListSearch cookie to PageNumber=N, request a page at the same address and receive in response a list opened immediately from page N.
For example:

$ curl --cookie "AmListSearch=PageNumber=12" https://bankrot.fedresurs.ru/ArbitrManagersList.aspx > bankrot.html
$ firefox bankrot.html

In general, in such cases, one should act according to approximately the following algorithm:
  1. Use FireBug (or the built-in developer panel Tools->Web_Developer->Network) to catch an outgoing HTTP request for an action that causes AJAX content to be loaded.
  2. Determine through which parameter the variable is passed (page number, for example). This can be not only a GET request parameter, but also a POST form field, or a cookie, or even an arbitrary custom HTTP header.
  3. Determine the format and structure of the response. This can be an arbitrary HTML fragment (most often), or an entire HTML document, or XML, or JSON (the most correct option from the point of view of development), or in general an arbitrary text format that is parsed by the script after receiving (this is exactly the crazy format we have in this case, I didn’t even look at it, I immediately tried workarounds and found one).
  4. Write a script that generates queries similar to those that leave the page and parses the responses.

B
bioroot, 2016-04-21
@bioroot

Alternatively, in theory, you can tie PhantomJS as a webdriver for Selenium. Actually, I haven't tried it myself. When there was a task of parsing sites, it was easier for me to transfer directly to PhantomJS / SlimerJS. But there were many sites with a similar structure and it was not possible to figure out with your hands where ajax is and where it is not.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question