G
G
GM2mars2014-07-23 23:40:05
PHP
GM2mars, 2014-07-23 23:40:05

How to improve php function to get title of remote page?

Such a script:

function titleLink($url) {
    //проверяем, если кириллический домен, то конвертируем его
    if (preg_match('/[а-яА-Я]/', $url)) {
      require_once('modules/idna_convert.class.php');
      $convert=new idna_convert();
      $url=$convert->encode($url);
    }
    $title="";
    //получаем удаленную страницу
    @$page=file_get_contents($url); 
    if ($page) {
      //находим и выдираем титул
      if (eregi("<title>(.*)</title>", $page, $out)) {
        $title=$out[1];
       //проверяем кодировку, если windows-1251 то конвертируем в utf-8
        if (mb_check_encoding($title, 'Windows-1251') && !mb_check_encoding($title, 'UTF-8')) {
          $title=iconv("CP1251//IGNORE", "UTF-8", $title);
        }
      }
    }
    return $title;
  }

I get the result somewhere in 75% of requests. Moreover, the title sometimes cannot be obtained from the most ordinary average pages, and even, for example, on the second and third pages I received the title, but did not receive it on the fourth, from one site.
How can the script be improved for more successful parsing?

Answer the question

In order to leave comments, you need to log in

1 answer(s)
I
IceJOKER, 2014-07-23
@GM2mars

in one place mb_ in another iconv - maybe you should still use mb_?
in one place preg_match() in another eregi, yes you are kidding :D
mb_convert_encoding($title, 'utf-8'); //he will determine the encoding himself.
preg_match('~(.*?)~iu'); //i-case-insensitive search, u-for utf-8 encoding

<?php
function getTitle($url) {
    if(!$url) return ;
  $url = 'http://'.parse_url($url, PHP_URL_HOST);
  //проверяем, если кириллический домен, то конвертируем его
    if (preg_match('/[а-яА-Я]/i', $url)) {
      require_once('modules/idna_convert.class.php');
      $convert=new idna_convert();
      $url=$convert->encode($url);
    }
    $title="";
    //получаем удаленную страницу
    @$page=file_get_contents($url); 
    if ($page) {
      //находим и выдираем титул
      if (preg_match("~<title>(.*?)</title>~iu", $page, $out)) {
        $title=$out[1];
       //конвертируем в utf-8
        mb_convert_encoding($title, 'utf8');
      }
    }
    return $title;
  }
echo getTitle('http://toster.ru/q/траляля');

?>

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question