How to optimize Java Jsoup parsing HTML document?

S

Stanislav2016-11-20 12:13:34

Java

Stanislav, 2016-11-20 12:13:34

Hello. There is a need to explore a large number of html documents, I decided to use the popular Jsoup library (Please correct me if I made a mistake already at this step and there is a better library).
The problem is this:
The Html files in question are not quite in their normal form, at the very beginning of the file, before there is an entry from the URL of the address where the file was downloaded from + the delimiter (|||), i.e. have something like this:

http://www.13abc.com/weather |||
<!DOCTYPE html>
<html>
  <head></head>
  <body><body>
</html>

When submitted in such a composition, Jsoup parses it incorrectly, I get the result in this form:

Document doc = Jsoup.parse(new File("C:\\test\\27.html"), "UTF-8");
System.out.println(doc.html

<html lang="en">
 <head></head>
 <body>
  https://stascarz.com/firstPage/buyCar |||    
  <meta charset="UTF-8"> 
  <title>Buy buy cheap car in Toronto</title> 
  <meta name="viewport"> 
  <meta name="description" content="if you want to buy cheap and good car, please call to us"> 
  <meta name="keywords" content="cheap, buy, car, in , toronto"> 
  <meta property="og:type" content="video.movie"> 
  <link rel="canonical" href="">   
  <h1>h11 car buy</h1> 
  <h1>h12 car</h1> 
  <h2>h22</h2> 
  <h3>h33 buy cheap car</h3> 
  <strong>strong1 buy</strong> 
  <strong>strong2 car</strong> 
  <strong>strong3 cheap car</strong> 
  <b>b1 car</b> 
  <script src="https://www.google-analytics.com/analytics.js"></script> 
  <div itemscope="schema.org"></div> 
  <a href="twitter.com"></a> 
  <a href="stascarz.com/1" rel="dofollow"></a> 
  <a href="stascarz.com/1" rel="nofollow"></a> 
  <a href="stascarz.com/1"></a> 
  <a href="stascarz.com/1"></a> 
  <a href="another.com/1" rel="nofollow"></a> 
  <a href="another.com/1" rel="dofollow"></a> 
  <a href="another.com/1"></a>  
 </body>
</html>

That is, Jsoup distributes the content of head and URL at the beginning of the file as part of the body.
Removing the URL and the separator from the beginning, which is not surprising, the result is normal, correct.
Found this solution:

doc = Jsoup.parse(new File("C:\\test\\27.html"), "UTF-8");
Document newDoc = Jsoup.parse(doc.html().substring(doc.html().indexOf("|||") + 3));
System.out.println(doc.html());

In this case, cutting the same URL also produces a normal, healthy output, but since there is a need to parse the file with Jsoup 2 times, there are a lot of files, so you should worry about each action.
I will be glad to see any efficient comment and I am ready for the fact that I am missing some elementary solution.
Thank you for your attention!

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

A

al_gon, 2016-11-20
@stanislav_studzinskiy

You need to get rid of it (file + delimiter (|||)) for sure.
Otherwise , jsoup will make it look like this:

<html>
 <head></head>
 <body>
  http://www.13abc.com/weather |||     
 </body>
</html>

PS: And so jsoup is a good choice.