A
A
Alexey Mairin2018-03-18 14:05:48
JavaScript
Alexey Mairin, 2018-03-18 14:05:48

Why are errors thrown when parsing a large number of pages?

Actually the question is this, I'm trying to parse a site with a large number of pages.
It works, up to a certain point, then throws out a bunch of errors
. Here are the errors:

E:\Study\SPOcoursework\src\index.js:12
var $ = cheerio.load(ress.body);
^

TypeError: Cannot read property 'body' of undefined
at E:\Study\SPOcoursework\src\index.js:12:42
at done (E:\Study\SPOcoursework\src\node_modules\needle\lib\needle.js: 440:14)
at ClientRequest.had_error (E:\Study\SPOcoursework\src\node_modules\needle\lib\needle.js:450:5)
at emitOne(events.js:116:13)
at ClientRequest.emit(events. js:211:7)
at TLSSocket.socketCloseListener (_http_client.js:363:9)
at emitOne (events.js:121:20)
at TLSSocket.emit (events.js:211:7)
at _handle.close (net. js:554:12)
at TCP.done [as _onclose] (_tls_wrap.js:356:7)
npm ERR! code ELIFECYCLE
npm ERR! errno 1
npm ERR! [email protected] start: `node index.js`
npm ERR! Exit status 1
npm ERR!
npm ERR! Failed at the [email protected] start script.
npm ERR! This is probably not a problem with npm. There is likely additional logging output above.

npm ERR! A complete log of this run can be found in:
npm ERR! C:\Users\USER\AppData\Roaming\npm-cache\_logs\2018-03-18T10_56_46_511Z-debug.log


And also the code:
var needle = require('needle');
    var cheerio = require('cheerio');
    var URL = 'https://www.menu.by/minsk/delivery/home.html';

    needle.get(URL, function(err, res) {
        var $ = cheerio.load(res.body);
        $('a.title').each(function (i, element) {
            var a = $(this);
            var url = "https://www.menu.by"+a.attr('href');
            var title = a.text();

            needle.get(url,function (error,ress) {
               var $ = cheerio.load(ress.body);
                $('div.prod-content').each(function (i, element) {
                    var c=$(this);
                    var name = c.text();

                    var json = {
                        name: title,
                        url: url,
                        subName: name
                    };
                    console.log(json);
                });
            });
        });
    });


Here is a brief summary of what I do so as not to delve into it for a long time:

1) from the main page I get links to other pages (to which I need to go and get information)
2) I execute the same code as for 1 pass, but for subpages
3) The data is displayed , until some n-th moment (as I understand it, there are too many requests)
4) links are taken randomly for some reason (although if you display the url, then there are all links from the main page)

Actually, how to deal with this?

Answer the question

In order to leave comments, you need to log in

2 answer(s)
A
Alexey Yarkov, 2018-03-18
@yarkov

In both callbacks, the FIRST argument is the error object, or null if everything is fine.
So start handling this situation and everything will work fine. Now you just think that there can be no mistakes))

S
sim3x, 2018-03-18
@sim3x

An error occurs inside

{ Error: read ECONNRESET
    at exports._errnoException (util.js:1020:11)
    at TLSWrap.onread (net.js:580:26) code: 'ECONNRESET', errno: 'ECONNRESET', syscall: 'read' }

Increased timeouts and set stream_length to 0 following advice
var needle = require('needle');
var cheerio = require('cheerio');

var URL = 'https://www.menu.by/minsk/delivery/home.html';

var needle_params =  {
  open_timeout: 60000,
  read_timeout: 60000,
  compressed: true,
  stream_length: 0
}

needle.get(URL, needle_params, function(err, res) {
  var $ = cheerio.load(res.body);
  $('a.title').each(function(i, element) {
    var a = $(this);
    var url = "https://www.menu.by" + a.attr('href');
    var title = a.text();

    needle.get(url, needle_params, function(error, response) {

      if (error) {
        needle.get(url, needle_params, function(error, response) { 
          // почему работает со второго раза - я так и не понял
          console.log(
            url,
            response ? response.statusCode : 'response is bad bro')
        });
      }
    });
  });
});

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question