How beautiful to count the words on the site?

A

Anton2018-07-19 13:37:27

Python

Anton, 2018-07-19 13:37:27

I set out to count the number of certain words on the site.
I threw the code

import requests
from bs4 import BeautifulSoup
import re


word = 'Pitton'
url = 'https://en.wikipedia.org/wiki/Joseph_Pitton_de_Tournefort'
count = 0

r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
# убираю теги html
w = re.sub(r'<[^>]+>', '', str(soup))
# отделяю не буквы от слов для корректного сплита
w = re.sub(r'\W', ' ', w)

for i in w.split():
    if i.lower() == word.lower():
        count += 1

print(count)

But such code does not take into account the text that remains in the script
. Of course, one more regular expression could be written, but maybe there is a library that will remove all unnecessary from html and turn it into a beautiful string?

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

A

Alexey Makarenya, 2018-07-19
@makarenya

There is this, it will convert html to
html2text
text. And people also do this:

for script in soup(["script", "style"]):
    script.extract()
text = soup.get_text()