How to go to the internal link of the site and parse data from there?

T

t55fr656tg77732020-10-09 01:20:54

Java

t55fr656tg7773, 2020-10-09 01:20:54

I need to take the product cards of the site (price, photos, description, etc.) to pick up all the product cards, I must connect to the site (I did). Now the question is how do I go through all the links of the site and take only the product information? I watched how recursion works, but I just can’t figure out how to take only product cards
, my code

import java.io.IOException;
    import java.util.HashSet;
    import java.util.Set;
    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;
    import org.jsoup.select.Elements;
    
    public class readAllLinks {
    
        public static Set<String> uniqueURL = new HashSet<String>();
        public static String my_site;
    
        public static void main(String[] args) {
    
            readAllLinks obj = new readAllLinks();
            my_site = "al-style.kz";
            obj.get_links("https://al-style.kz/");
        }
    
        private void get_links(String url) {
            try {
                Document doc = Jsoup.connect(url).userAgent("Mozilla").get();
                Elements links = doc.select("a");

                if (links.isEmpty()) {
                   return;
                }

                links.stream().map((link) -> link.attr("abs:href")).forEachOrdered((this_url) -> {
                    boolean add = uniqueURL.add(this_url);
                    if (add && this_url.contains(my_site)) {
                        System.out.println(this_url);
                        get_links(this_url);
                    }
                });
    
            } catch (IOException ex) {
    
            }
    
        }
    }

I don’t really understand the whole logic of the code, because sometimes it gives out some kind of game
There is another code but it doesn’t have recursion (it works more logically but only gives the first links of directories)
Someone can help or explain how recursion works in the code

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

O

Orkhan, 2020-10-09
@t55fr656tg7773

public class readAllLinks {
Class names must begin with a capital letter. Read about name convention
As for your code, there are many nuances.

I need to take the product cards of the site (price, photos, description, etc.) to pick up all the product cards I have to connect to the site (I did)

You just opened the main page of the site and selected ALL links (tag a).
Elements links = doc.select("a");

Now the question is how do I follow all the links of the site and take only the information of the product?

I would do the following. Instead of collecting all links, I would collect links to sections (categories).

Here, the selector
#categories .sub-menu-item .sub-menu-link
Next, collect them in some kind of List
Next, iterate over this list and follow the link, just like here

doc = Jsoup.connect(url).userAgent("Mozilla").get();

instead of url there will be a link from the sheet parsed from the menu (picture above)
The product catalog page has pagination.
For example, https://al-style.kz/catalog/mobilnye_telefony/

See how pagination works

https://al-style.kz/catalog/mobilnye_telefony/
https://al-style.kz/catalog/mobilnye_telefony/?PAGEN_1=2
?PAGEN_1={pageNum}

In fact, a query param is added to the URL, which is incremented, which means that after we go to the category page, we add this parameter for each category and increment its value until the pages run out. Depending on the site, you can check whether the page exists or not in different ways.
For example, to check whether a particular block can be viewed or exists.
Further on each page we find blocks (product cards).
Here, the selector:
.elements .element

Find the link selector and also save it in a separate List
.elements .element .link
After we have gone through the category page by page and collected a list of all links of product cards, we iterate over this list and also open these links.
Those. page of the product itself - for example,https://al-style.kz/catalog/mobilnye_telefony/mobi...
Well, then it remains to collect data using selectors, save it in pojo (for example, Product ) and export it somewhere.
Apache POI can be used to export to xlsx