L
L
lemonlimelike2018-10-14 22:20:17
PHP
lemonlimelike, 2018-10-14 22:20:17

Why don't you want to parse movie search?

Hello! Wrote a parser. First of all, I get a json response, which contains the id of the movie on the movie search. From this id I make a link to the movie search. Then I get the data for that movie. There are about 15k such id's. It turns out that the parser needs 15k times to go to each link to the movie search and take the necessary data. But only the first 2 movies are parsed, then an error that the `$img` variable is empty. Then if I immediately run the script again, then it immediately .. on the first processing of the first array, an error occurs that `$img` is empty. What to do? Tell me, is there an alternative to film search?
Here is the script:

$url = file_get_contents('url');
$json = json_decode($url, true);
echo 'Работает: '. PHP_EOL;
$start = microtime(true);
foreach ($json as $k => $value) {
    // echo $k. '--';
    // print_r($value);
    echo "Заходим на Kinopoisk по id: " . $value['kinopoisk_id'] . PHP_EOL;
    $newUrl = new Crawler(file_get_contents('https://www.kinopoisk.ru/film/' . $value['kinopoisk_id']));

    echo 'Ждем 5 сек раз'.PHP_EOL;
    sleep(5);

    $img = $newUrl->filter('.popupBigImage img')->attr('src');

    echo 'Ждем 5 сек два'.PHP_EOL;
    sleep(5);

    $rating = $newUrl->filter('.rating_ball')->text();

    echo 'Ждем 5 сек три'.PHP_EOL;
    sleep(5);

    $films = Films::where('name',$value['name'])->first();
    if($films == null){
        $film = new Films();
        $film->name = $value['name'];
        $film->translate = 'Нормальный';
        $film->url = $value['url'];
        $film->year = $value['year'];
        $film->img = $img;
        $film->rating = $rating;
        $film->kinopoisk_id = $value['kinopoisk_id'];
        $film->save();
    }
}

Answer the question

In order to leave comments, you need to log in

4 answer(s)
D
DanKud, 2018-10-14
@lemonlimelike

First, of course, check what response comes from the request to the Kinopoisk page on those iterations of the loop when you get an error. There, most likely, the captcha appears.
Secondly, use file_get_contentsfull-fledged libraries to connect the page not primitive. The same default CURLfor example. Pass full HTTP headers, mimicking a normal user's browser. Considering the specifics of Kinopoisk and the fact that everyone is trying to parse it, such banal user blocking as requests with empty headers cost 100% there.
Well, and accordingly, if you think further and the captcha will still appear, then already write a captcha bypass script using the appropriate services.
Plus one additional moment of parsing using the librarySymfony\DomCrawler for the future. Load sources for parsing not directly through the class call new Crawler(), but after calling the method ->addHtmlContent()to avoid encoding problems:

$newUrl = new Crawler();
$newUrl->addHtmlContent(file_get_contents('https://www.kinopoisk.ru/film/' . $value['kinopoisk_id']));

I
Ilya Malinovsky, 2018-10-14
@iliya936

Try to just make a file_get_contents request and print its response, most likely you will see either a captcha page or that access is denied, I think that such a giant as film search takes care that their site is not parsed.

A
Alex-1917, 2018-10-15
@alex-1917

What you are trying to parse is periodically uploaded by certain faces in full archives...
Be friends with Google)))
And please don't write the phrase file_get_contents anymore, I'm allergic to file_get_contents....

V
Vlad Osadchyi, 2018-10-15
@VladOsadchyi

Here KinoPoisk on Python is parsed

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question