V
V
Valentine2015-10-06 12:09:06
Programming
Valentine, 2015-10-06 12:09:06

Algorithm for parsing pages by a list of keywords?

Good afternoon. The question is about the process of parsing the page itself, and not getting it (using curl or any other tool).
There is a certain page (HTML-document) and the list of keywords. You need to get the number of occurrences of each word on the page. It occurred to me only to generate a regular expression (something like (word1|word2|word3) ), and then consider the number of occurrences as a simple enumeration.
What are more elegant solutions? I suppose to implement in PHP or nodejs.

Answer the question

In order to leave comments, you need to log in

3 answer(s)
⚡ Kotobotov ⚡, 2015-10-06
@angrySCV

busting, busting, sorry, my mother brought me up like that, according to thieves' notions.
where does phpshnikov have such love to implement everything through searches?
when your keywords exceed several thousand, you will wait for hours for the results of your searches, especially in php).
a more elegant solution is to use suffix trees.

D
Dmitry Kim, 2015-10-06
@kimono

sandbox.onlinephpfunctions.com/code/37932fd36ced8e...

$text = 'Добрый день.Вопрос про сам процесс парсинга страницы, а не ее получение (с помощью curl или любого другого инструмента).
Есть некая страница (HTML-документ) и список ключевых слов. Необходимо получить количество вхождений каждого слова на странице. Мне пришло в голову только генерировать регулярку (что-то вроде (слово1|слово2|слово3)), а потом считать простым перебором количество вхождений. 
Какие есть более изящные решения? Реализовывать предполагаю на PHP или nodejs.';

preg_match_all('/слово|документ/ui', $text, $matches, PREG_PATTERN_ORDER);

print_r($matches);

Array
(
    [0] => Array
        (
            [0] => документ
            [1] => слово
            [2] => слово
            [3] => слово
        )
)

A
abcd0x00, 2015-10-07
@abcd0x00

The loaded set needs to be made.
Then you just get a sequence of all the words on the page, go through them, comparing with the template, and increment the counter in the set for the corresponding key.
Here is an illustration in python

>>> words = ['a', 'b', 'c', 'a', 'b', 'b']
>>> 
>>> d = {}
>>> for i in words:
...     if i in d:
...         d[i] += 1
...     else:
...         d[i] = 1
... 
>>> print(d)
{'b': 3, 'c': 1, 'a': 2}
>>>

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question