Answer the question
In order to leave comments, you need to log in
How to parse news sites?
Hello!
The management set the task to parse news sites and record the latest news in text files.
You need to parse several sites, put the news from each site in your daddy. Also sort by date into folders.
A web service is supposed to be launched manually or by cron.
Also, a Windows application is possible.
Where to begin? There should always be some libs that will simplify life in this development.
It is necessary to put a pause in the bypass, so as not to fall under the block due to the large number of requests.
Do not redownload already downloaded news.
Clean from html tags, because only text is needed.
etc.
Clearly, many people have gone down this path before me.
Thank you.
Answer the question
In order to leave comments, you need to log in
You can parse in different ways.
If without programming, then the same ZennoPoster.
If with programming, then depending on what language you know. As far as I know the most popular languages for parsing: python, php, c#. Each language has its own packages for parsing / emulating the browser (selenium for emulation / packages for parsing DOM pages using css selectors, XPath to pull out text).
Unfortunately, little information was given for a more specific answer.
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question