How to collect documents from the site?

A

Andrey Ermakov2019-07-18 10:54:10

JavaScript

Andrey Ermakov, 2019-07-18 10:54:10

It is necessary to solve the following problem:
1. Every day, articles are published on the news outlets I need.
2. You need to take the texts of articles from a specific container.
3. It is important that articles are saved in txt format and signed with the date and time of collection.
4. Directory for saving, either immediately a computer or a cloud (such as mail.ru).
5. The sampling frequency is 1 time per day, or manually (by pressing the button).
How to collect documents?

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

B

bespechnost, 2019-07-18
@bespechnost

many news resources have rss. You can request it on nodejs, parse it and add it to the cloud.
If you parse pages, you can use https://github.com/GoogleChrome/puppeteer

H

hzzzzl, 2019-07-18
@hzzzzl

the first couple of lines of the article are usually put in rss, just so that it would be impossible to parse the full article from there
without ads
:) /h...
https://itnext.io/scraping-with-nodejs-and-cheerio...
request will bring html pages, with cheerio you can easily parse blocks with content by css selectors
puppeteer and other "headless browsers" usually not needed for this .