I
I
iamXado2020-09-17 18:36:41
Python
iamXado, 2020-09-17 18:36:41

How to make an automatic parser?

There is this code:

import telebot
import config
from time import sleep
from bs4 import BeautifulSoup
import requests

bot = telebot.TeleBot(config.token)

@bot.message_handler(commands = ['start'])
def start(message):

    html = requests.get("https://www.rbc.ru/short_news")
    soup = BeautifulSoup(html.text, 'lxml')
    title = soup.find('span', class_ = 'item__title-wrap')
    href = soup.find('div', class_ = 'item__wrap l-col-center')

    while html.status_code == 200:

        for t in title.find_all('span', class_ = 'item__title rm-cm-item-text')[:1]:

            answer_title = t.text.strip()
            print(answer_title)

        for h in href.find_all('a', class_ = 'item__link')[:1]:

            answer_href = h.get('href')
            print(answer_href)

            bot.send_message(message.chat.id, f'{answer_title}\n\n{answer_href}')

            sleep(5)

if __name__ == '__main__':
    bot.polling(none_stop = True)


It parses news feeds from RBC (title + link), or rather the latest news.

mLVMV.png

Sj3Qx.png

I have two questions.

1. How to parse not the latest news, but any (for example, the penultimate one).
2. And how to make a check for new news so that the program understands that new news has come out and immediately parses it.

PS I also found that the same news is parsed during the timer. That is, the program is running, the news is parsed, and after the specified time interval, even if new news appears on the site, the same news will be parsed until I restart the program.

Answer the question

In order to leave comments, you need to log in

1 answer(s)
S
soremix, 2020-09-17
@iamXado

1. How to parse not the latest news, but any (for example, penultimate)

It is logical that you need to find all the news, and select the penultimate
2. And how to make a check for new news so that the program understands that new news has come out and immediately parses it.

How do you know it's new news? Most likely, you remember the title of the latest news, and when you refresh the page, you find the latest article again, and compare the title with the one you remember? Unexpectedly, but for the bot everything is exactly the same. Find the latest article at the moment -> save its title in a variable -> after X time, find the latest news again, and compare the titles.
I also found that the same news is parsed with the timer. That is, the program is running, the news is parsed, and after the specified time interval, even if new news appears on the site, the same news will be parsed until I restart the program.

That's right, you got the page code once and didn't update it again
html = requests.get("https://www.rbc.ru/short_news")

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question