D
D
Denis99992015-12-21 07:09:25
Python
Denis9999, 2015-12-21 07:09:25

How to find the number of occurrences of a word in the text of a web page using Python 3?

The task is described in the question. I have been looking for an hour on the Internet for something and how, but it seems that there is no progress. Tell me what functions to use, put me on the right path, so to speak))

Answer the question

In order to leave comments, you need to log in

2 answer(s)
P
Pavel Karateev, 2015-12-21
@Lancelote

And what is the actual problem? Well, at least like this:
text.count(word)

N
nirvimel, 2015-12-21
@nirvimel

from collections import Counter
import re

from lxml.html import fromstring
from lxml.html.clean import Cleaner
import requests


def extract_text(node):
    """
    Extract text without markup from node
    """

    def extract_text_gen(node):
        if node.text:
            yield node.text.strip()
        for child in node.iterchildren():
            yield from extract_text_gen(child)
            if child.tail:
                yield child.tail.strip()

    return ' '.join((s for s in extract_text_gen(node) if s))


def count_words(text):
    return Counter((s for s in re.split(r'\s', text) if s))


html = requests.get('https://toster.ru/q/276749').content.decode('utf-8')
root = fromstring(html)
Cleaner()(root)
text = extract_text(root.body)
words_count = count_words(extract_text(root))

print('\n'.join(('"%s": %i' % (word, count) for word, count in words_count.most_common())))

The function extract_textis taken from one of my projects, slightly adapted and simplified.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question