How to programmatically crawl through Wiktionary(Wiki-dictionary) pages?

V

Vetal Matitskiy2015-05-28 08:50:35

Java

Vetal Matitskiy, 2015-05-28 08:50:35

Good afternoon, dear development gurus,
I'm trying to get a list of necessary words from en.wiktionary.org using the Jaunt engine ( http://jaunt-api.com)
wrote the first version of the application that runs through part of the pages, but then we fall with an error,
the main algorithm is simple to the point of banality: I go to the start page, pull out a link to the next page from it, go to the next page, and so on.
I don't know why, but the resulting link becomes longer each time, and apparently at a certain step they become so long that they stop being processed. although when manually traversing the pages, the links have a normal length.
Is it possible to prevent link swelling?
examples of links that can be pulled out

next page: en.wiktionary.org/w/index.php?title=Category:Engli...
next page: en.wiktionary.org/w/index.php?title=Category:Engli...
next page: en .wiktionary.org/w/index.php?title=Category:English...

the code itself looks like

import com.jaunt.*;

public class Wiksurfer {

    public static void surfPages() {
        int i = 0;
        UserAgent userAgent = new UserAgent();
        userAgent.settings.autoSaveAsHTML = true;  //change settings to autosave last visited page.
        //System.out.println("SETTINGS:\n" + userAgent.settings);   
        try {

            String href = "http://en.wiktionary.org/wiki/Category:English_uncountable_nouns";

            for (i = 0; i < 20; i++) {
                userAgent.visit(href);
                href = userAgent.doc.findFirst("<a title>next page").getAt("href");

                System.out.println("next page:" + href);
            }
        } catch (JauntException e) {
            System.err.println(e);
        } finally {
            System.out.println("final i" + i);
        }
    }

    public static void main(String[] args) {
        surfPages();
    }

}