D
D
dani1a2014-10-12 20:47:22
PHP
dani1a, 2014-10-12 20:47:22

Simple_html_dom does not work, namely the search for tags. What is the problem?

There is a test code

require_once ('simple_html_dom.php');
$html=file_get_html('http://ya.ru'); 
$ret = $html->find('.content a');
echo $ret[0]->href;

But it doesn't return anything, and indeed $ret is returned as an empty array. If you specify in the find method $html->find('a'); everything will be the same. If you try to parse not a page, but simply pass the html text to a variable, then the class finds only the first link, and that's it. And with pages in general does not want to work. At the same time, an object is passed to $html, the file_get_contents function used in the class on the server works. Tell me what else could be the problem?

Answer the question

In order to leave comments, you need to log in

2 answer(s)
D
dani1a, 2014-10-13
@dani1a

For some reason, your option is only everything that <head></head>gives out. Whichever site substituted The
issue was resolved, simple_html_dom requires mbstring.func_overload 0

S
Sharov Dmitry, 2014-10-13
@vlom88

To begin with, try to insert a check into the script for the availability of the page, add a function

function get_http_response_code($url) {
    $headers = get_headers($url);
    sleep(2);
    return substr($headers[0], 9, 3);
}

file_get_html remake as follows
function file_get_html($url, $use_include_path = false, $context = null, $offset = -1, $maxLen = -1, $lowercase = true, $forceTagsClosed = true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN = true, $defaultBRText = DEFAULT_BR_TEXT, $defaultSpanText = DEFAULT_SPAN_TEXT) {
    // We DO force the tags to be terminated.
    $dom = new SimpleHtmlDom(null, $lowercase, $forceTagsClosed, $target_charset, $stripRN, $defaultBRText, $defaultSpanText);
    // For sourceforge users: uncomment the next line and comment the retreive_url_contents line 2 lines down if it is not already done.
    $errorsCode = ['404', '301', '302', '502'];
    $responce = get_http_response_code($url);
    if (!in_array($responce, $errorsCode)) {
        $contents = file_get_contents($url, $use_include_path, $context, $offset);
    } else {
        return false;
    }
    
    // Paperg - use our own mechanism for getting the contents as we want to control the timeout.
    //$contents = retrieve_url_contents($url);
    if (empty($contents) || strlen($contents) > MAX_FILE_SIZE) {
        return false;
    }
    // The second parameter can force the selectors to all be lowercase.
    $dom->load($contents, $lowercase, $stripRN);
    return $dom;
}

And for starters, just check if the script receives the page
require_once ('simple_html_dom.php');
$html=file_get_html('http://ya.ru'); 
echo $html

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question