M
M
Msim2015-06-21 11:23:52
PHP
Msim, 2015-06-21 11:23:52

How to parse more than 90 pages on Simple HTML DOM?

When set to 30 or 60 , everything parses, but more than 90 pages gives an error
Fatal error: Call to a member function find() on boolean in E:\srv\OpenServer\domains\parser\index.php on line 51

<form method="POST">
    <input name="url" type="text" value="<?=isset($_REQUEST['url'])?$_REQUEST['url']:'http://citymarket.ua/';?>"/><input type="submit" value="Пошел">
</form>
<?php

include 'simple_html_dom.php';

function request($url,$post = 0){
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url ); // отправляем на 
    curl_setopt($ch, CURLOPT_HEADER, 0); // пустые заголовки
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); // возвратить то что вернул сервер
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); // следовать за редиректами
    curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 30);// таймаут4
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
    curl_setopt($ch, CURLOPT_COOKIEJAR, dirname(__FILE__).'/cookie.txt'); // сохранять куки в файл 
    curl_setopt($ch, CURLOPT_COOKIEFILE,  dirname(__FILE__).'/cookie.txt');
    curl_setopt($ch, CURLOPT_POST, $post!==0 ); // использовать данные в post
    if($post)
        curl_setopt($ch, CURLOPT_POSTFIELDS, $post);
    $data = curl_exec($ch);
    curl_close($ch);
    return $data;
}

class parser{
    var $cacheurl = array();
    var $result = array();
    var $_allcount = 60;
    function __construct(){
        if(isset($_POST['url'])){
            $this->parse($_POST['url']);
        }
    }
    function parse($url){
        $url = $this->readUrl($url);

        if( !$url or $this->cacheurl[$url] or $this->cacheurl[preg_replace('#/$#','',$url)] )
            return false;

        $this->_allcount--;

        if( $this->_allcount<=0 )
            return false;

        $this->cacheurl[$url] = true;
        $item = array();

        $data = str_get_html(request($url));
        $item['url'] = $url;
        $item['title'] = count($data->find('title'))?$data->find('title')->plaintext:'';
        $item['text'] = count($data->find('img.item-image'))?$data->find('img.item-image')->src:'';
        $this->result[] = $item;

        if(count($data->find('a'))){
            foreach($data->find('a') as $a){
                $this->parse($a->href);
            }
        }
        $data->clear();
        unset($data);

    }
    function printresult(){
        foreach($this->result as $item){
            echo '<h2>'.$item['title'].' - <small>'.$item['url'].'</small></h2>';
            echo '<p style="margin:20px 0px;background:#eee; padding:20px;">'.'<img src="'.$item['text'].'"/>'.'</p>';
        };
        exit();
    }
    var $protocol = '';
    var $host = '';
    var $path = '';
    function readUrl($url){
        $urldata = parse_url($url);
        if( isset($urldata['host']) ){
            if($this->host and $this->host!=$urldata['host'])
                return false;

            $this->protocol = $urldata['scheme'];
            $this->host = $urldata['host'];
            $this->path = $urldata['path'];
            return $url;
        }

        if( preg_match('#^/#',$url) ){
            $this->path = $urldata['path'];
            return $this->protocol.'://'.$this->host.$url;
        }else{
            if(preg_match('#/$#',$this->path))
                return $this->protocol.'://'.$this->host.$this->path.$url;
            else{
                if( strrpos($this->path,'/')!==false ){
                    return $this->protocol.'://'.$this->host.substr($this->path,0,strrpos($this->path,'/')+1).$url;
                }else
                    return $this->protocol.'://'.$this->host.'/'.$url;
            }
        }
    }
}
$pr = new Parser();
$pr->printresult();

Answer the question

In order to leave comments, you need to log in

3 answer(s)
A
Alexander Taratin, 2015-06-21
@Msim

https://github.com/chuyskywalker/rolling-curl
+
https://github.com/olamedia/nokogiri

A
Alexey Skleinov, 2017-05-12
@lexskal

I'm still amazed by the people

$data->clear(); // чистим - молодцы
 unset($data);

where is the return?
12 characters in this example make it clear who and for what reason is moving somewhere.
After all, the point is actually in crooked hands or sour gray matter. Actually, you need to smoke manuals to begin with, than to show your illiteracy in the matter, and even more so to give any advice.
Sorry for the flood, but the question is very common and not resolved in essence. This example of using the library has repeatedly come across to me, and this error is also present in the original source of this example.

C
CORRECTOR86, 2018-03-02
@CORRECTOR86

I had a similar problem but with 1 page. The issue was resolved by increasing the value of the constant "define('MAX_FILE_SIZE', 600000)" in the file "simple_html_dom.php". For example: 60000000. It helped me. In my case, the file size was larger than the specified limit and its download was interrupted at 600000. Good luck. Experiment.
I found the solution here:
https://www.canbike.org/information-technology/php...

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question