Answer the question
In order to leave comments, you need to log in
How to extract JSON from JavaScript with python, which is in HTML?
The point is this. An order for a parser for the site came. Everything seemed to be tracked, how, what and in what sequence. The data is loaded via AJAX dynamically on scroll. There is no difficulty in intercepting this data either, but the data itself is a terrible mess. Namely, a request is made via POST and an entire HTML page is sent to this request. This page has as many as 7 inline JavaScripts. They are loaded in discord with a different sequence (apparently such a primitive protection against parsing). In one of the scripts, there is a contentData variable, in the form of JSON. It has a lot of things, including those data that the client actually needs. In another JS script, there is another variable, and it is also in the form of JSON (dictionary), in which there is a certain ID, for loading the next page of data.
My algorithm (as I see it) is the following:
1. I download HTML
2. Through BS4 I pull out all the "script" tags
3. By brute force I find the script that contains the required JSON variable
4. I pull out the first dictionary from this script
5. I take data from it, I write it in CSV
6. By brute force , it is possible in the same cycle, I find a script with the second variable-JSON
7. I take data from it, form a link and drive again ...
And in this whole algorithm there were difficulties with points 3-4 and 6-7.
And if I can find the name of the variable through a banal search or through a regular expression in the text of the script (item 3 or 6), then I’ll never know how to pull out JSON from the JS code.
Here in the code it says var contentData = {...[{...[{...}]...}]...}
How can I take this "contentData" from JS and push it into python JSON?
I hope I explained clearly.
Answer the question
In order to leave comments, you need to log in
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question