B
B
Bogdan2014-12-27 01:28:37
HTML
Bogdan, 2014-12-27 01:28:37

How to parse a website using QNetworkAccessManager?

I need to find such a piece on the page, and get a link to the picture

<div class="row-fluid">
            <strong>
                Скачать оригинал:
                <a href="pictures/originals/2014/Nature_Highway_in_the_mountains_082434_.jpg" class="original-link" download="pictures/originals/2014/Nature_Highway_in_the_mountains_082434_.jpg" title="Шоссе в сторону гор">Шоссе в сторону гор - 1920x1080</a>
            </strong>
        </div>

Tried to start at least just download the Html page
void DownloadHtml::Download()
{
    manager = new QNetworkAccessManager(this);

    connect(manager, SIGNAL(finished(QNetworkReply*)),
            this, SLOT(replyFinished(QNetworkReply*)));

    manager->get(QNetworkRequest(QUrl("http://google.com")));
}
void DownloadHtml::replyFinished (QNetworkReply *reply)
{
    if(reply->error())
    {
        qDebug() << "Error!";
        qDebug() << reply->errorString();
    }
    else
    {
        QFile *file = new QFile("C:/wall/downloaded.txt");
        if(file->open(QFile::Append))
        {
            file->write(reply->readAll());
            file->flush();
            file->close();
        }
        delete file;
    }

    reply->deleteLater();
}

But the result was
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>302 Moved</TITLE></HEAD><BODY>
<H1>302 Moved</H1>
The document has moved
<A HREF="http://www.google.ru/?gfe_rd=cr&amp;ei=3cudVPqyMYTzwAOP5oCQCg">here</A>.
</BODY></HTML>

Answer the question

In order to leave comments, you need to log in

1 answer(s)
S
Sergey Lagner, 2014-12-27
@threadbrain

Well, you're right about that part. If we try to do it with curl, we get exactly the same thing.

@home-tower:~$ curl -i http://google.com
HTTP/1.1 302 Found
Cache-Control: private
Content-Type: text/html; charset=UTF-8
Location: http://www.google.ru/?gfe_rd=cr&ei=EPmeVPyRK6Or8wf5-IDABA
Content-Length: 258
Date: Sat, 27 Dec 2014 18:23:12 GMT
Server: GFE/2.0
Alternate-Protocol: 80:quic,p=0.002

<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>302 Moved</TITLE></HEAD><BODY>
<H1>302 Moved</H1>
The document has moved
<A HREF="http://www.google.ru/?gfe_rd=cr&amp;ei=EPmeVPyRK6Or8wf5-IDABA">here</A>.
</BODY></HTML>

If instead of google.com you put a url that does not return a redirect, you will get the desired html.
Further it will be more interesting, because this html needs to be parsed. I had this issue not too long ago.
The problem is that there is no easy html parser in Qt (not the monstrous QtWebkit, I mean). There are parsers for Xml, but they break on many pages. That's why I used Gumbo - Google's implementation of the html parser. I wrapped it for my project to make it more Qt'. What happened can be found on github . Pay attention to tests. They are as examples of use, instead of documentation.
I hope this helps
UPD step by step :
1. create a Subdirs Project in QtCreator
2. New subproject and select the type of application that you need Qt Console Application for example
3. In the console, go to the project folder and write
[email protected]:~/Projects/htmlparsing$ git init
[email protected]:~/Projects/htmlparsing$ git submodule add https://github.com/lagner/QGumboParser.git lib
  Cloning into 'QGumboParser'...
  remote: Counting objects: 96, done.
  remote: Total 96 (delta 0), reused 0 (delta 0)
  Unpacking objects: 100% (96/96), done.
  Checking connectivity... done.

[email protected]:~/Projects/htmlparsing$ git submodule update --init --recursive

4. Go to the IDE, in the root pro file add: SUBDIRS += lib/QGumboParser
5. For the application, do Add Library -> Internal Library. There, everything you need will already be selected. Plus, you need to add CONFIG += c++11 to the pro file
7. Open main.cpp and write:
#include <QCoreApplication>
#include <QDebug>
#include <QNetworkAccessManager>
#include <QNetworkRequest>
#include <QNetworkReply>
#include <qgumbodocument.h>
#include <qgumbonode.h>


void requestFinished(QNetworkReply*);
void parseHtml(QString html);


int main(int argc, char *argv[])
{
    QCoreApplication a(argc, argv);

    QNetworkAccessManager nm;
    QObject::connect(&nm, &QNetworkAccessManager::finished, requestFinished);

    nm.get(QNetworkRequest(QStringLiteral("http://toster.ru/q/168437")));

    return a.exec();
}


void requestFinished(QNetworkReply* rep) {
    if (rep->error() == QNetworkReply::NoError) {
        QByteArray rawdata = rep->readAll();
        QString html = QString::fromUtf8(rawdata);

        parseHtml(html);

    } else {
        qDebug() << "request failed: " << rep->errorString();
    }

    rep->deleteLater();
    QCoreApplication::quit();
}


void parseHtml(QString html) {
    try {
        QGumboDocument doc = QGumboDocument::parse(html);
        QGumboNode root = doc.rootNode();

        auto nodes = root.getElementsByTagName(HtmlTag::TITLE);
        for (auto& node: nodes) {
            qDebug() << "title: " << node.innerText();
        }

    } catch (...) {
        qCritical() << "smth wrong";
    }
}

Everything should work.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question