Answer the question
In order to leave comments, you need to log in
How to optimize Java Jsoup parsing HTML document?
Hello. There is a need to explore a large number of html documents, I decided to use the popular Jsoup library (Please correct me if I made a mistake already at this step and there is a better library).
The problem is this:
The Html files in question are not quite in their normal form, at the very beginning of the file, before there is an entry from the URL of the address where the file was downloaded from + the delimiter (|||), i.e. have something like this:
http://www.13abc.com/weather |||
<!DOCTYPE html>
<html>
<head></head>
<body><body>
</html>
Document doc = Jsoup.parse(new File("C:\\test\\27.html"), "UTF-8");
System.out.println(doc.html
<html lang="en">
<head></head>
<body>
https://stascarz.com/firstPage/buyCar |||
<meta charset="UTF-8">
<title>Buy buy cheap car in Toronto</title>
<meta name="viewport">
<meta name="description" content="if you want to buy cheap and good car, please call to us">
<meta name="keywords" content="cheap, buy, car, in , toronto">
<meta property="og:type" content="video.movie">
<link rel="canonical" href="">
<h1>h11 car buy</h1>
<h1>h12 car</h1>
<h2>h22</h2>
<h3>h33 buy cheap car</h3>
<strong>strong1 buy</strong>
<strong>strong2 car</strong>
<strong>strong3 cheap car</strong>
<b>b1 car</b>
<script src="https://www.google-analytics.com/analytics.js"></script>
<div itemscope="schema.org"></div>
<a href="twitter.com"></a>
<a href="stascarz.com/1" rel="dofollow"></a>
<a href="stascarz.com/1" rel="nofollow"></a>
<a href="stascarz.com/1"></a>
<a href="stascarz.com/1"></a>
<a href="another.com/1" rel="nofollow"></a>
<a href="another.com/1" rel="dofollow"></a>
<a href="another.com/1"></a>
</body>
</html>
doc = Jsoup.parse(new File("C:\\test\\27.html"), "UTF-8");
Document newDoc = Jsoup.parse(doc.html().substring(doc.html().indexOf("|||") + 3));
System.out.println(doc.html());
Answer the question
In order to leave comments, you need to log in
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question