Parsing (scrapping) obtaining information from sites, authorization, tools, examples?

A

Albert2020-11-21 19:58:28

Parsing

Albert, 2020-11-21 19:58:28

Good day!

I'm completely confused, help me figure it out or kick me in the right direction.

In general, the problem is as follows:
-There are sites that give out information -Javascript
+ html works on sites, some generally turn on React there sadomia occurs in requests from the browser to the server
-On sites you need to go through authorization
-Information is obtained by downloading files or so static html that is loaded by ajax -Some captchas
are present, both Google and regular in the form of a picture, enter characters (images are loaded after the page is loaded)

As I see the solution to the issue, the application on Spring boot is spinning on the server; users send requests via telegrams, receive information and are satisfied.

I tried to do it through RestTempalate, built the logic, saved cookies, but ran into a problem with javascript not working, and in the case of resources on React, I couldn’t track the installation of all cookies that the resource sets in the browser, I got access denied. With POST, GET requests everything is perfect where there are forms.

I tried to do it through HtmlUnit , everything seems to be beautifully described, javascript support, css navigation is sane, but there are a lot of errors at the first request and the captcha for authorization is not loaded ...

2020-11-21 11:29:58.296  INFO 17496 --- [  restartedMain] com.ssnbuild.ssn.Application             : Started Application in 5.129 seconds (JVM running for 7.907)
2020-11-21 11:30:35.988 ERROR 17496 --- [legram Executor] c.g.h.javascript.StrictErrorReporter     : runtimeError: message=[An invalid or illegal selector was specified (selector: '*,:x' error: Invalid selector: :x).] sourceName=[https://info.is/media/bower_components/jquery/dist/jquery.min.js] line=[2] lineSource=[null] lineOffset=[0]
2020-11-21 11:30:36.701  INFO 17496 --- [legram Executor] c.g.h.javascript.JavaScriptEngine        : Caught script exception

com.gargoylesoftware.htmlunit.ScriptException: URIError: Malformed URI sequence. (https://info.is/media/dist/js/main.js#1)
  at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:921) ~[htmlunit-2.23.jar:2.23]

......

2020-11-21 11:30:37.059 ERROR 17496 --- [legram Executor] c.g.h.javascript.StrictErrorReporter     : error: message=[missing ) after formal parameters] sourceName=[https://info.is/media/js/login.js] line=[1] lineSource=[var _0x295d=['\x77.тут много краказябр js кода >10к символов\x35')](doRestore);}});continue;case'\x35':_0x1a7d78[_0x4a84('0x196','\x73\x52\x4a\x4b')](renewCaptcha);continue;}break;}});] lineOffset=[62091]
2020-11-21 11:30:37.065  INFO 17496 --- [legram Executor] c.g.h.javascript.JavaScriptEngine        : Caught script exception

com.gargoylesoftware.htmlunit.ScriptException: missing ) after formal parameters (https://info.is/media/js/login.js#1)
  at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:921) ~[htmlunit-2.23.jar:2.23]
  at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:628) ~[htmlunit-core-js-2.23.jar:na]
  at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:515) ~[htmlunit-core-js-2.23.jar:na]

....


2020-11-21 11:30:37.075  INFO 17496 --- [legram Executor] c.g.h.javascript.JavaScriptEngine        : Caught script exception

com.gargoylesoftware.htmlunit.ScriptException: URIError: Malformed URI sequence. (https://info.is/media/dist/js/main.js#1)
  at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:921) ~[htmlunit-2.23.jar:2.23]
  at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:628) ~[htmlunit-core-js-2.23.jar:na]

I looked in the direction of Silenium, but there are large add-ons, a browser is needed, etc.

Therefore, I decided to ask you ... Someone may have used something to solve such issues, or what thread to read is adequate for wooden ones so that it would be clear how to work with the component.

Ideally, I would like to see a virtual browser with the ability to fill out forms by receiving DOM elements, and what would be there without my presence, the javascript provided by the developers would be processed.

Help than you can ... please :)

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

O

Orkhan Hasanli, 2020-11-21
@azerphoenix

Hello!
Let's start with a simple one, when the content of the site is loaded without frameworks. Accordingly, there are no Ajaxes, no need to scroll anywhere to get the next page or click on buttons, etc. to get materials. Those. you just need to send a GET request to a certain site and get the data.
In this case, the jsoup library will be enough for parsing. Or use DOM & SAX Parser for your custom implementation.
Now, let's complicate the task a little, the site is formed in the same way without frameworks, but authorization is needed to gain access to information. If some simple authorization is used here, then it will be enough to get cookies once and point to the server with each request. Also don't forget about referrer & user-agent.
Now, let's complicate the task even more - the content is generated dynamically (using js-frameworks or Ajax requests, etc.). In this case, jsoup won't help, because you need to click on the (Download more) button to load content, or scroll down to trigger content loading, etc. Those. you need some interactivity. To do this, you should look towards Selenium + (any browser). As a browser, you can use - firefox, chromium, etc. For speed, it is desirable to use headless browsers.
We complicate the task further. I had to log in and solve some captcha. In particular, recaptcha. Here I will say in advance that I myself once looked for possible workarounds for a long time and the simplest solution is to use a paid service.
Link to the site - https://anti-captcha.com/
After entering the username and password, selenium triggers a click on the captcha, and then we send the data to the server and get the captcha solution.
Let's complicate the task even more - different honeypots. Here, as they say, who cares what. It all depends on the specific site and the specific implementation (software) of the honeypot. Some may block by ip if the request was made to a non-existent url. For example, there are only 100 pages on the site, and you requested 101 pages and fell into the trap. Or, for example, you filled in an invisible input field, which the user normally does not see and, accordingly, does not fill out.
Going further - if you need some kind of interactivity (i.e. the site user should be able to parse the site on their own), then you need a client-side written in javascript. Similar online services are available. Type in google web scraping online and you will see various services. As a rule, they offer to install some kind of extension, when clicking on which it gets access to DOM elements, and then you can use selectors (id, xpath, class ) to determine what needs to be parsed. Define the type of navigation / pagination (for example, pagination using pagination or pagination using the Next button, etc.). There may be pitfalls here. For example, some sites may give an error (404) when the maximum page is reached, some do not give an error and only show the content again. Sometimes you need to check the page for the presence of emptiness on the page (for the absence of elements by the selector). Sometimes you need to check the page for 404 errors, etc. In general, this is the work of the front-end.
Some dynamically generated pages can load content using json or xml. Accordingly, for parsing some sites, you can do without the use of selenium. And just request materials using their internal API, and then use gson or jackson to parse them.
One of the universal scraping tools that I had to deal with was the Visual Web Ripper program. It costs about 250-300 dollars. The program loads the content of the site inside itself through IE (maybe they have already updated this moment). And then you can already set the parsing conditions and export the data.

N

Nadim Zakirov, 2020-11-21
@zkrvndm

There are browser extensions that allow you to run arbitrary JavaScript on websites, just use them. Your task will come down to sketching out a script that fills in the fields and clicks the buttons, after which the resulting UserScript is run through one of the extensions described above on the target site.
Yes, I’ll even say more, if you wish, you can do without extensions - just open the browser console, paste and run any JavaSscript there, including you can run the code to parse something.

I

Inviz Custos, 2020-11-21
@MvcBox

1) https://github.com/puppeteer/puppeteer
2) https://2captcha.com/ru