V
V
Vetal Matitskiy2015-05-28 08:50:35
Java
Vetal Matitskiy, 2015-05-28 08:50:35

How to programmatically crawl through Wiktionary(Wiki-dictionary) pages?

Good afternoon, dear development gurus,
I'm trying to get a list of necessary words from en.wiktionary.org using the Jaunt engine ( http://jaunt-api.com)
wrote the first version of the application that runs through part of the pages, but then we fall with an error,
the main algorithm is simple to the point of banality: I go to the start page, pull out a link to the next page from it, go to the next page, and so on.
I don't know why, but the resulting link becomes longer each time, and apparently at a certain step they become so long that they stop being processed. although when manually traversing the pages, the links have a normal length.
Is it possible to prevent link swelling?
examples of links that can be pulled out

next page: en.wiktionary.org/w/index.php?title=Category:Engli...
next page: en.wiktionary.org/w/index.php?title=Category:Engli...
next page: en .wiktionary.org/w/index.php?title=Category:English...

the code itself looks like
import com.jaunt.*;

public class Wiksurfer {

    public static void surfPages() {
        int i = 0;
        UserAgent userAgent = new UserAgent();
        userAgent.settings.autoSaveAsHTML = true;  //change settings to autosave last visited page.
        //System.out.println("SETTINGS:\n" + userAgent.settings);   
        try {

            String href = "http://en.wiktionary.org/wiki/Category:English_uncountable_nouns";

            for (i = 0; i < 20; i++) {
                userAgent.visit(href);
                href = userAgent.doc.findFirst("<a title>next page").getAt("href");

                System.out.println("next page:" + href);
            }
        } catch (JauntException e) {
            System.err.println(e);
        } finally {
            System.out.println("final i" + i);
        }
    }

    public static void main(String[] args) {
        surfPages();
    }

}

Answer the question

In order to leave comments, you need to log in

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question