V
V
vox_termin2021-08-31 19:39:10
HTML
vox_termin, 2021-08-31 19:39:10

How to setup parsing from html to csv (or sql) without 5xx server error?

During the execution of the php-script of the parser, at about the 300th record, the server issues a 5xx error. After that, the script can add another 500-600 records in the background (out of 30,000)
How to configure the parser so that it writes all 30,000 records without server errors?

include "simple_html_dom.php";
header('Content-type: text/plain');

$filename = 'name.csv'; //файл для записи csv
$file = "urls.txt"; //файл со ссылками на все 30000 статей

$fields = file($file, FILE_SKIP_EMPTY_LINES | FILE_IGNORE_NEW_LINES );
$fp = fopen($filename, 'a');
$i=1;

foreach($fields as $field) :

        $url1 = $field; 
//ссылка на статью для создания dom, надо метатеги и заголовок
       
        $url2 = $field/content.html;   
//ссылка на файл с контентом каждой статьи. Здесь только контент, без заголовка и метатегов
    
        $content1 = @file_get_contents($url2);
        $content1 = str_replace(array("\r\n", "\n", "<br />", "<br/>"), "", $content1);
        $_content1 = addslashes($content1);
    
        $html = new simple_html_dom();
        $html = file_get_html($url1);
        
        $title = $html->find('h1',0)->plaintext;
        $_title = addslashes($title);
        $metakey = $html->find( "meta[name=keywords]" );
        $metadesc = $html->find( "meta[name=description]" );
        $html->clear();
        unset($html);
        $metakey1 =  $metakey[0]->content;
        $metadesc1 = $metadesc[0]->content;

        fputcsv($fp, array($i, $_title, $metakey1, $metadesc1, $_content1 ));
        
        $i++;
  endforeach; 
        fclose($fp);

Answer the question

In order to leave comments, you need to log in

2 answer(s)
S
ScriptKiddo, 2021-08-31
@ScriptKiddo

How can I configure the parser to write all 30000 records without server errors?

You need to pause between links so as not to get banned from the rate limiter

M
Mike, 2021-09-05
@mSnus

To begin with, check what exactly is causing the 500 error (enable and see the php log, details in it).
Most likely, the max execution time has been exceeded, if the server is yours - you can increase it, if not - you need to divide the source file into pieces (300 records, ok, you say?) And execute the script in turn with each one.
It's also possible that parsing html with regexps will be faster than building the DOM.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question