How to link PHP Simple HTML DOM Parser with cURL?

M

midarovrk2020-05-18 13:11:59

PHP

midarovrk, 2020-05-18 13:11:59

Help link php simple html dom parser with curl.

Wrote a simple PHP-based image parser Simple HTML DOM Parser
The parser uploads images to its server by url. But there is one thing, the documentation says:

Unfortunately file_get_html loads pages with regular file_get_contents. This means that if the hoster has set allow_url_fopen = false in php.ini (that is, it has forbidden to open files remotely), then uploading something remotely will not work. And serious websites should not be parsed in this way, it is better to use CURL with proxy and ssl support. However, for our experiments, file_get_html is quite enough.

They advise using it in conjunction with cURL.

Here is my parsing code.

<?php
require_once 'simple_html_dom.php';

// поисковый URL
$url = 'https://сайт.org/ссылка'
$n = 200;
// загружаем данный URL
$data = file_get_html($url);
// очищаем страницу от лишних данных
foreach($data->find('script,link,comment') as $tmp)$tmp->outertext = '';
// находим все изображения на странице
if(count($data->find('div#all img'))){
  $i = 1;
  foreach($data->find('div#all img') as $img){

  Ну и тут уже сам код парсинга.

    if($i>$n)break; // выходим из цикла если скачали достаточно фотографий
  }
}
$data->clear();// подчищаем за собой
unset($data);
?>

How to bind cURL to this code, i.e. so that $url and $data can be used later in PHP's Simple HTML DOM Parser ?

$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, 'https://сайт.org/ссылка');
   curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_USERAGENT,'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17');
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
  curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 30);
 curl_setopt($ch, CURLOPT_REFERER,'https://сайт.org');
$url = curl_exec($curl);
curl_close($curl);

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

E

Eugene, 2020-05-18
@midarovrk

Divide the code into 3 logical parts and implement them independently
1. get page code (curl, guzzle)
2. parse and get image url (dom parser, didom, symfony/dom-crawler)
3. download images (curl, guzzle, wget)