L
L
lokomotor722021-03-06 13:15:06
Parsing
lokomotor72, 2021-03-06 13:15:06

How to parse photos using preg_match_all?

Hello everyone

, I have a code that parses images from a site using a link

<?php
// страница с картинками
$text = file_get_contents("ссылка на сайт для парсинга");
 
// выцеливаем путь к картинкам и помещаем их в массив
preg_match_all("'<img\s+src=\"(\S*.(png|jpg))\"'si", $text, $result);      
 
echo"Найдено картинок на странице = ".count($result[1]);
//print_r($result[1]); // найденные картинки
echo("<br>");
echo("<br>");
echo("Urls - картинок...");
// создаём папку если таковой нет
if (!file_exists("images")) 
{
   mkdir("images", 0700); // создаём папку
}
 
$move_dir = "images/"; // Директория созданной папки
for($i = 0; $i <=(count($result[1])-1); $i++) 
{
// формируем урл на картинку  
$url = "http://www.site.com/".$result[1][$i]; 
echo("<br>");
echo($url);
echo("<br>");
$filename = basename($url); // Имя картинки  
file_put_contents($move_dir.'/'.$filename, file_get_contents($url));    
}
echo("<br>");
echo("Копирование завершено!");
 
?>


They are taken in a block with an img tag

<figure itemprop="associatedMedia" itemscope="" itemtype="http://schema.org/ImageObject">
      <a data-w="174" data-h="250" class="item item-gallery" href="http://www.site.com/get_image/2/f4f6ea5666f7319419d4436374de951b/main/1920x1920/10000/10423/1152485.jpg/" itemprop="contentUrl" data-size="1280x1920" style="width: 154px; height: 220px; display: block;">
          <img src="http://www.site.com/contents/albums/main/370x250/10000/10423/1152485.jpg" itemprop="thumbnail" alt="">
      </a>
                                                                                   
    </figure>


But you need to take the link that surrounds the img tag
on which class = "item item-gallery" is also worth

MB, who can tell me what changes need to be made to the code itself?) Thanks in advance)

Answer the question

In order to leave comments, you need to log in

3 answer(s)
O
Oleg, 2021-03-06
@lolzqq

here

preg_match_all("'<img\s+src=\"(\S*.(png|jpg))\"'si", $text, $result);

more specifically , rewrite
"'<img\s+src=\"(\S*.(png|jpg))\"'si"
the template here so that it catches the href attribute from the tag <a>
in order to search for this line:
href="http://www.site.com/get_image/2/f4f6ea5666f7319419d4436374de951b/main/1920x1920/10000/10423/1152485.jpg/

For experiments, you have an assistant: https://regex101.com/
I have a template like this:
href=".*([a-z\_\-\.])*\.((jpg)|(png))\/"
hsBW6mH4BFk.jpg?size=1002x212&quality=96&sign=9d65b71aaf4e97f166a36fd76a3863dd&type=album
Next, through str_replace, replace the occurrences href="and "with an empty string ""and you have a clean url at the output.

R
Rasul Gitinov, 2017-12-17
@fantazerno

Try in the control panel "Settings" -\u003e "Permalinks" select some other type of display links and save.

M
Mr Crabbz, 2017-12-17
@Punkie

Update duplicator. In one of the latest versions, there was a bug that changed the % symbol in the database to this kind of rubbish. With the update of the plugin, this problem has disappeared for me personally.
solution for an already installed site with a bug:
1. download the database to a sql file, open it with an editor
2. do a search / replace from
{03873571dc18fad47add251c551321dbad75fc58166b9b4f6f1c1bdbb6ac251}
to
%
3. save, upload the database.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question