Answer the question
In order to leave comments, you need to log in
Is NodeJS good for web scraping?
Recently, at work, I often have to parse sites. Now I mainly use PHP Simple DOM HTML Parser, but I want something more advanced - for example, to be able to automatically follow links in pagination, etc. I'm looking at NodeJS and its modules. Is it worth delving into this topic or is this infrastructure not much different in capabilities from PHP-based solutions?
Answer the question
In order to leave comments, you need to log in
A good alternative to phantom is Nightmare: https://github.com/segmentio/nightmare
It's not about the language. And in the instrument.
For example Scrapy is a great tool. It's in Python. But I think there are also PHP and JS (Node) you just need to look for.
If we are talking about full JS emulation, then yes, only JS.
But it's not NodeJS - it's essentially slash JS.
A headless-browser in JS - this provides full emulation of the browser DOM and full control. For example PhantomJS. But if you don't need full DOM emulation to parse your site (the site isn't a fancy AJAX site), then you don't need PhantomJS.
Then just look for a convenient library for your favorite language...
When you have a crowbar in your hands , everything seems to be a site
Worth a look at golang
Никто не пишет про cheerio.
Неужели он на столько плох? Мне показалось, что он достаточно облегчен и алгоритм прост (загрузка-расчленение-сохранение). Возникла только проблема с циклом. Когда прогоняешь его по всем страницам в цикле - он пытается этот цикл выполнить одновременно(синхронно). А вот как заставить цикл продолжать выполнение только после завершения предыдущего шага - не понял пока. Если можете - ткните носом, пожалуйста!
Идея была написать свой API к сайтам, которые этого не предусматривают...
Посмотри в сторону ANTLR. Сам инструмент на Java, но есть targer в JS
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question