Answer the question
In order to leave comments, you need to log in
How to programmatically crawl through Wiktionary(Wiki-dictionary) pages?
Good afternoon, dear development gurus,
I'm trying to get a list of necessary words from en.wiktionary.org using the Jaunt engine ( http://jaunt-api.com)
wrote the first version of the application that runs through part of the pages, but then we fall with an error,
the main algorithm is simple to the point of banality: I go to the start page, pull out a link to the next page from it, go to the next page, and so on.
I don't know why, but the resulting link becomes longer each time, and apparently at a certain step they become so long that they stop being processed. although when manually traversing the pages, the links have a normal length.
Is it possible to prevent link swelling?
examples of links that can be pulled out
next page: en.wiktionary.org/w/index.php?title=Category:Engli...
next page: en.wiktionary.org/w/index.php?title=Category:Engli...
next page: en .wiktionary.org/w/index.php?title=Category:English...
import com.jaunt.*;
public class Wiksurfer {
public static void surfPages() {
int i = 0;
UserAgent userAgent = new UserAgent();
userAgent.settings.autoSaveAsHTML = true; //change settings to autosave last visited page.
//System.out.println("SETTINGS:\n" + userAgent.settings);
try {
String href = "http://en.wiktionary.org/wiki/Category:English_uncountable_nouns";
for (i = 0; i < 20; i++) {
userAgent.visit(href);
href = userAgent.doc.findFirst("<a title>next page").getAt("href");
System.out.println("next page:" + href);
}
} catch (JauntException e) {
System.err.println(e);
} finally {
System.out.println("final i" + i);
}
}
public static void main(String[] args) {
surfPages();
}
}
Answer the question
In order to leave comments, you need to log in
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question