How to speed up JSOUP for web scraping?

R

reus2017-03-24 16:09:37

css

reus, 2017-03-24 16:09:37

In general, I decided to try to parse using jsoup. Here is the code:

package org.my.parse;

import org.jsoup.Jsoup;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;

/**
 * Example program to list links from a URL.
 */
public class Parse {
  public static void main(String[] args) throws IOException {
    String url = "https://www.olx.ua/obyavlenie/novaya-kvartira-remont-2016-goda-IDn1Xw4.html#dddea08ac8;promoted";
    System.out.printf("Fetching %s...\n", url);

    Document doc = Jsoup.connect(url).get();
    String newSelector = "#offerdescription > div.clr.descriptioncontent.marginbott20 > table > tbody > tr:nth-child(2) > td:nth-child(2) > table > tbody > tr > td > strong";
    Elements links = doc.select(newSelector);
    System.out.println(links.text());
    System.out.println("End");
  }
}

I launch a parser in eclipse and.. It really takes a long time to parse, I use a css selector (you need to parse a lot of elements from a page with a very simple setting: chrome -> cope -> copy selector). Is it worth trying to use xpath (to use xpath in jsoup, you also need to put xsoup ..)?

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

A

al_gon, 2017-03-24
@reus

Switching to XPathnothing really will not change. What is DOMit.
I see two options:

do not use Jsoupas http library, take apache http or google http client
and you don't show us anything. Both the page and the CSS Path itself are nothing special. I didn't see any pitfalls.

F

f9k56, 2017-03-26
@f9k56

It was so. Try setting the selector differently.