Is NodeJS good for web scraping?

Crash2016-05-13 10:07:04

HTML

Crash, 2016-05-13 10:07:04

Recently, at work, I often have to parse sites. Now I mainly use PHP Simple DOM HTML Parser, but I want something more advanced - for example, to be able to automatically follow links in pagination, etc. I'm looking at NodeJS and its modules. Is it worth delving into this topic or is this infrastructure not much different in capabilities from PHP-based solutions?

Answer the question

In order to leave comments, you need to log in

9 answer(s)

Maxim, 2016-05-13
@maxfarseer

great, as mentioned above - phantomjs is very cool!

Muhammad, 2016-05-13
@muhammad_97

A good alternative to phantom is Nightmare: https://github.com/segmentio/nightmare

spotifi, 2016-05-13
@spotifi

It's not about the language. And in the instrument.
For example Scrapy is a great tool. It's in Python. But I think there are also PHP and JS (Node) you just need to look for.
If we are talking about full JS emulation, then yes, only JS.
But it's not NodeJS - it's essentially slash JS.
A headless-browser in JS - this provides full emulation of the browser DOM and full control. For example PhantomJS. But if you don't need full DOM emulation to parse your site (the site isn't a fancy AJAX site), then you don't need PhantomJS.
Then just look for a convenient library for your favorite language...

bromzh, 2016-05-13
@bromzh

https://ru.wikibooks.org/wiki/Grab

sim3x, 2016-05-13
@sim3x

When you have a crowbar in your hands , everything seems to be a site
Worth a look at golang

Sstrax, 2016-05-19
@Sstrax

Никто не пишет про cheerio.
Неужели он на столько плох? Мне показалось, что он достаточно облегчен и алгоритм прост (загрузка-расчленение-сохранение). Возникла только проблема с циклом. Когда прогоняешь его по всем страницам в цикле - он пытается этот цикл выполнить одновременно(синхронно). А вот как заставить цикл продолжать выполнение только после завершения предыдущего шага - не понял пока. Если можете - ткните носом, пожалуйста!
Идея была написать свой API к сайтам, которые этого не предусматривают...

�

Максим Кудрявцев, 2016-05-13
@kumaxim

Посмотри в сторону ANTLR. Сам инструмент на Java, но есть targer в JS

hOtRush, 2016-12-08
@hOtRush

I would advise python or go, parse with nodges, also with fanotomzhs - you will need a cloud of iron resources