A
A
alexwprof2019-04-18 16:01:16
PHP
alexwprof, 2019-04-18 16:01:16

Php Query + Curl + Pagination How to parse paginated pages?

First you need to organize the parsing of information from the google search engine. Everything is written and everything almost works.
There is a code, on other sites it goes through pagination. Not on google. In the code, we get the page elements by the selector and do something with them. Now I'm displaying on the screen to see the result.

public function pagination($url, $start, $end){
        //Получение данных на странице с пагинацией
        if ($start < $end) {
            $file = file_get_contents($url);
            $doc = phpQuery::newDocument($file);
            foreach ($doc->find('#res') as $art) {
                $art = pq($art);
                $this->range = $art->find('#search');

                echo '<hr>';
            }

            $next = 'https://www.google.com' . $doc->find('#nav a')->next()->attr('href');
            var_dump($next);
            if (!empty($next)) {
                $start++;
                $this->pagination($next, $start, $end);

            }

        }

    }

The line $next = ' https://www.google.com ' . $doc->find('#nav a')->next()->attr('href'); is responsible for finding a new element in pagination. But if you wardump the $next variable, it will be empty, the href attribute is not received.
If you remove next() from the line, the variable will receive the required number of links from pagination and will receive elements from the second page.
Maybe someone knows an analogue of the next() PhpQuery method? Since, I repeat, the code is working, but it is on Google that next () breaks everything. As a result, I need to go through the pagination and parse the pages.

Answer the question

In order to leave comments, you need to log in

2 answer(s)
R
Rodion Gashé, 2019-04-18
@alexwprof

load html as xml
DOMDocument::loadHTML - Load HTML from string
then use XPath to find required
DOMXPath::query - Executes given XPath expression

M
Maxim Tkach, 2019-04-26
@mtseo

Approximately like this:

/* Находим ссылку на следующую страницу */
    $next_page = pq($doc)->find('li.pagination__item--next > a')->attr('href');
    $next_url = $base_url . $next_page;


    /* Проверяем, есть ли следующая страница */
    if (!empty($next_page)){
        sleep(5);
        urls_parser($next_url);
    }

Check if there is a link to the next page and, if so, pass its values ​​to the URL for parsing and run again.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question