Answer the question
In order to leave comments, you need to log in
How to find an area with useful content on a website page?
The essence of the task is to search for certain words on the pages of the site. The search will be handled by a browser extension.
Problem: how to determine that I am looking for a word not in the entire document, but in its useful part?
For example, it would be possible to immediately throw out tags from the search, ala aside
, nav
... maybe even header
- footer
although they can also contain useful info, for example, there will be an article title in the header. Search all over body
? - then I will find the text in advertising blocks, which is not good, has
anyone already solved this problem?
Answer the question
In order to leave comments, you need to log in
This question was discussed on the toaster repeatedly, he wrote the answers himself. Search by resource.
The task itself is a bit difficult in two ways:
1. Not everyone typesets according to standards
2. You need to specifically look at the structure of the resource being parsed
If we are talking about any resource in general and any data according to some pattern, then there are a lot of problems. They make up everything as it suits them.
I wrote a parser a year ago to pull out emails and phones, so the best result is 56%. That is, out of 100 pages, I received 56 contacts. And this is for previously known formats for which you can write a regular expression ...
well, in short: this is the task of finding the MAIN content of the page.
1. Remove all containers (except for text markup tags) with more than 1 child elements.
2. Clean up the body container from all tags except container tags (div,td)
3. Find the container (div,td) with the longest text .
4. Feel free to rob him.
It was:
<div1>
<div2>
<a href="/1/">link1</a>
<a href="/2/">link2</a>
</div>
<div3>
<span contetnt>
some text
<p>
<i>more text</i>
</p>
</span contetnt>
</div3>
</div1>
<div3>
some text
more text
</div3>
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question