M
M
mit5x2019-04-04 10:21:46
Parsing
mit5x, 2019-04-04 10:21:46

How to parse news sites?

Hello!
The management set the task to parse news sites and record the latest news in text files.
You need to parse several sites, put the news from each site in your daddy. Also sort by date into folders.
A web service is supposed to be launched manually or by cron.
Also, a Windows application is possible.
Where to begin? There should always be some libs that will simplify life in this development.
It is necessary to put a pause in the bypass, so as not to fall under the block due to the large number of requests.
Do not redownload already downloaded news.
Clean from html tags, because only text is needed.
etc.
Clearly, many people have gone down this path before me.
Thank you.

Answer the question

In order to leave comments, you need to log in

2 answer(s)
J
Jan, 2019-04-04
@on1k

You can parse in different ways.
If without programming, then the same ZennoPoster.
If with programming, then depending on what language you know. As far as I know the most popular languages ​​for parsing: python, php, c#. Each language has its own packages for parsing / emulating the browser (selenium for emulation / packages for parsing DOM pages using css selectors, XPath to pull out text).
Unfortunately, little information was given for a more specific answer.

K
Kudis, 2019-04-04
@kudis

If handles or according to the schedule.
You can write a simple extension for chrome (the top of cross-platform).
You can run it manually or from any scheduler.
Send me a message and I'll help you get started.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question