Why does the work of my parser take all the resources and cut down the server?

E

Elizabeth Lawrence2019-04-26 10:13:53

PHP

Elizabeth Lawrence, 2019-04-26 10:13:53

Hello. Please help, I'm running my cron scraper, fetching some information from my vendor's website and processing it. The parser starts up only on 500 product pages, I have as many as 8,000 of them, but even for 500 articles, the parser eats up all the resources and brings the server to a non-working state, it will not connect to it via ssh and all sites give 500 hundredth errors. I understand that most likely the point is that my parser is not optimized, since I myself do not understand this yet, I ask for advice, what should be rewritten here? The parser first logs in to the supplier's website, because this is the only way to see the remains, and then collects information on behalf of the authorized user. As it collects, it generates a request to update the database, it throws the requests into an array, only at the end of the work one connection to the database is made and an update request is executed. I have 1 GB of RAM on the server.

ini_set('max_execution_time', '10000');
set_time_limit(0);
ini_set('memory_limit', '768M');
ignore_user_abort(true);

require_once 'vendor/autoload.php';
require_once 'phpquery/phpQuery/phpQuery.php';

//УРЛ для выполнения авторизации
$url_auth = 'http://...';

//Заданный мною массив, где ключ это артикул товара, а значение его product_id в моем магазине
$massiv = [
"артикул поставщика" => "мой product_id",
...
]

//Объявляю массивы, которые могут быть заполнены впоследствии.
$existart = []; $existartstatus = []; $existartstatus2 = []; $notupdated = [];

//Создаю файлы, в которые буду записывать нужные мне значения по ходу работы
$file_result = 'not_added.txt'; if (file_exists($file_result)) unlink($file_result);
$file_result2 = 'empty.txt'; if (file_exists($file_result2)) unlink($file_result2);
$file_result3 = 'not_updated.txt'; if (file_exists($file_result3)) unlink($file_result3);

//Данная функция будет вызываться для парсинга каждой отдельной страницы товара для того, что авторизованным забрать содержимое страницы
function get_content($url) {
  $ch = curl_init($url);
  curl_setopt ($ch, CURLOPT_HEADER, 0);
  curl_setopt ($ch, CURLOPT_FOLLOWLOCATION, true);
  curl_setopt ($ch, CURLOPT_RETURNTRANSFER, true);
  curl_setopt ($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)");
  curl_setopt ($ch, CURLOPT_SSL_VERIFYPEER, false);
  curl_setopt ($ch, CURLOPT_SSL_VERIFYHOST, false);
  curl_setopt ($ch, CURLOPT_POST, true);
  curl_setopt ($ch, CURLOPT_POSTFIELDS, array(
    'login' => '###',
    'pass' => '###',
  ));
  curl_setopt ($ch, CURLOPT_COOKIEJAR, __DIR__ . '/cook.txt');
  curl_setopt ($ch, CURLOPT_COOKIEFILE, __DIR__ . '/cook.txt');
  curl_setopt ($ch, CURLOPT_TIMEOUT, 3000);
  curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, 300);
  $res = curl_exec($ch);
  curl_close($ch);
  return $res;
}

$merged = "это массив со ссылками на карточки товаров, по которым нужно пройтись"

//Главная функция, отрабатывающая для каждой ссылки
function foreach_parser() {
  global $merged; global $massiv; global $existart; global $existartstatus; global $existartstatus2; global $notupdated; global $file_result; global $file_result2; global $file_result3;
  foreach ($merged as $page){
    $file = get_content($page);
    $doc = phpQuery::newDocument($file);
    $doc = pq($doc);

      $art = $doc->find('#r div.x div.xx div.xxx')->text();
      $art = str_replace("/"," ",$art);
      $art = trim($art);
          
      /* Тут еще выполняется несколько операций по нахождению значений и их обработки. Определяются переменные $stock, $status и прочие */
      
      //Проверяю содержится ли в моем заранее заданном массиве элемент с ключом, равным данному артикулу, если да, то для него забирается его значение		
      if (isset($massiv[$art])) {
        if ($status == "Preorder") {
          $value = $massiv[$art];
          $existart[] = "WHEN product_id = ".$value." THEN ".$stock;
          $existartstatus[] = "WHEN product_id = ".$value." THEN 'Под заказ'";
          $existartstatus2[] = "WHEN product_id = ".$value." THEN 24";					
        } else {
          $value = $massiv[$art];
          $existart[] = "WHEN product_id = ".$value." THEN ".$stock;				
        }
        $value2 = $massiv[$art];
        $notupdated[] = $value2;
      } else {
        //Элемента массива с таким ключом не найдено, значит записываем данный артикул в файлик
        $message = $art.PHP_EOL;
        file_put_contents($file_result, $message, FILE_APPEND);
      }
      
      echo $art." обработан! ";		
    $i++;
  }
  
  //Если мои массивы заполнились каким-то данными, тогда я сливаю их элементы в единую строку
  if($existart) {$existart_oneline = implode(" ", $existart);}
  if($existartstatus) {$existartstatus_oneline = implode(" ", $existartstatus);}
  if($existartstatus2) {$existartstatus2_oneline = implode(" ", $existartstatus2);}
  $massiv_onlyid = implode(",", $massiv);
  
  //Сравниваю изначально заданный мною массив с полученным в результате парсинга массивом для того, чтобы найти те товары, которые у меня в массиве (на сайте) есть, а в процессе работы парсинга не были задействованы, так я понимаю, какие остатки у меня не обновились.
  $mas_notupdated = array_diff($massiv, $notupdated);
  if ($mas_notupdated) { $mas_notupdated_txt = implode('`', $mas_notupdated); file_put_contents($file_result3, $mas_notupdated_txt); }
  
  //Подключаюсь к базе данных и выполняю запросы на обновление остатков и при необходимости других полей
  $linkmysql = mysqli_connect('localhost', 'xxx', 'xxx', 'xxx');	
  
  if (!$linkmysql) {
    $sqlconnecterror = "Ошибка: Невозможно установить соединение с MySQL.";
    exit;
  }
  if ($linkmysql) {		
    if($existart) {
      Первый запрос на обновление информации
    }
    if($existartstatus) {
      Второй запрос на обновление информации
    }

    mysqli_close($linkmysql);
  }
    
  phpQuery::unloadDocuments();	

}

$data = get_content($url_auth);
foreach_parser();

The parser loads the RAM up to 99.9% and that's it, nothing else works. I set him memory_limit', '512M', but he still takes all the RAM. How can he not be allowed to take all the resources?

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

I

Ilya Bobkov, 2019-04-26
@heksen

You have a curl request in a loop. When you make a request, a response from the server may or may not come, and memory is allocated in the meantime. Dig towards an asynchronous curl request. I think here is the problem.

S

synapse_people, 2019-04-26
@synapse_people

$doc = phpQuery::newDocument($file);
it's better to replace this shit with native DOMDocument, most likely memory disappears somewhere