V
V
VZVZ2016-01-25 00:40:40
JavaScript
VZVZ, 2016-01-25 00:40:40

Writing bots for sites with AJAX, using Selenium or PhantomJS. How to track changes made to the DOM already by JS (or AJAX requests themselves)?

I'll start from the beginning.
My study of such a promising area as the development of bots (automatic clients) for all sorts of different sites / services, I once, like any shkolota, began with browser engines (WebBrowser - Internet Explorer and Awesomium).
The first stumbling blocks were:
1) AJAX, which is so popular now, which dynamically loaded some content on many sites, and it was impossible to properly track, and therefore get this content
2) uploading files to input type=file fields (it is impossible to load via JS a file in such a field, except perhaps by the method of stupid autoclicking, but the engines did not provide any special means for this (although theoretically they could, damn it!))

Then I discovered an HTTP sniffer (Fiddler is my favorite), and then HTTP requests (I must say, I discovered requests a little earlier - working with official APIs, sort of like VK, but that acquaintance was very superficial, and this is already completely offtopic, because we are talking about those sites that do not have a suitable API at all).
The low-level nature of this approach provides its main advantages (speed, and versatility - it works for 99.99% of all sites), but it also creates disadvantages: laboriousness and instability - instability to any measures to combat bots from admins.
It is very difficult to imitate all the headers that the browser sends (otherwise it is very easy for the server to distinguish the bot from the browser).
Sometimes it is difficult to figure out which exactly sniffed requests should be sent and which should not.
It is also difficult sometimes to figure out how JS generates other values ​​​​(and to adjust the algorithm at home).
Etc.

In general, this is acceptable for many cases, so I'm not going to abandon this approach. But each task has its own tool.
And there are tasks where stability and speed of writing are more important than speed.
There is no desire to return to simple browser engines, therefore, at first, I was skeptical about Selenium and PhantomJS.

But it was very bribed that PhantomJS, it turns out, contains those very "special facilities" for loading files in input type=file , which simple non-special engines (their mother!) Don't provide!

And here are the questions:

1) Uploading files - OK.
Is it possible to use PhantomJS to also track and intercept changes in the DOM model that JS makes from AJAX requests?
It seems that there is an opportunity to intercept those requests (a code example in the thread still does not hurt)))
And if you need to catch changes in the DOM? Mutation Events work there? Or is there a "special tool" for that?

2) How is Selenium doing in terms of downloading files, tracking HTTP requests, changes in the DOM?

Answer the question

In order to leave comments, you need to log in

1 answer(s)
A
Alexey Sundukov, 2016-01-25
@alekciy

1) And why for such a high-level tool to track such a low-level as a change in the DOM? The maximum that is needed is XPath for pulling out data and wait () for waiting for the appearance of data calculated through JS or pulled through AJAX.
2) As far as I remember, none. Trite, even the status of the HTTP response can not be obtained.
You just need to understand that Selenium and webdriver in particular were written for testing, not writing bots. That it is used for bots as well is just a side effect. Therefore, whatever one may say, something will have to be finished by hand. In the context of PhantomJS, for example, add JS scripts for it (as an option, you should familiarize yourself with CasperJS in which some set of JS has already been written) that will complement the missing functionality.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question