W
W
Whatevermarever2019-06-09 13:40:39
Java
Whatevermarever, 2019-06-09 13:40:39

Need help parsing a WordPress site?

There is a site, you need to parse the photo and title from each post, from the first to the last page. What frameworks will be needed? Is it possible to get by with just jsoup? Are there any resources where you can find an approximate algorithm for going through articles and pages?

Answer the question

In order to leave comments, you need to log in

1 answer(s)
O
Orkhan, 2019-06-09
@Whatevermarever

Hello!
1) Do you need authorization on the site to access the content? Read how to log in to a site using jsoup.
2) It doesn't matter what CMS you are parsing.. WP or something else
3) Jsoup doesn't know how to work with dynamic content (for example, ajax pagination, scroll loading, etc.). Usually, if there is no dynamic content, then this is enough.
4) If there is still dynamic content - look towards Selenium + browser (FF || Chrome, etc.)
5)

Are there any resources where you can find an approximate algorithm for going through articles and pages?

There are plenty of resources, just search. And the general principle of passing through articles and pages is, in fact, just cycles.
6) It is possible to parse data without PL. For example, using the Visual Web Ripper program.
Sample parsing plan.
- Decide on the type of content. (see paragraphs 3 and 4)
- determine the authorization (and if authorization is needed, then implement authorization)
- determine the entry point. For example, the page of the category (heading) of the VP.
- determine the type of pagination. Usually, in a VI it is /page/1,2,3,4 etc. This depends on your goal. You can simply increment the page value up to max. values ​​(see which is the latest page) or, for example, can be incremented until the page has no characteristic blocks for records. (it all depends on the layout).
- Then in a loop - do {} while () или while() {}collect information (links) about existing records and add to some List.
- After that, again, cycle through the list and open the URLs and parse the content of the page itself. You can also connect Apache POI to export data to xlsx after parsing.
Usually, for convenience, I create an object (title, text, link to an image, etc.). Further you add all objects in certain List. And then you export this sheet to xls.
Here, here is a good snippet for exporting a List to Excel.
https://www.jeejava.com/generic-way-of-reading-exc...
If you need to import information into a WP site, then use the WP ALL IMPORT plugin. The xlsx files you created will work just fine

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question