How to link unrelated tags via bs4?

D

Dmitry2022-02-14 16:01:10

Fonts

Dmitry, 2022-02-14 16:01:10

The article has sections. The code itself is solid tags, that is, the sections are not marked with classes, they are not wrapped, in general they are not separated in any way. Through bs4 I am trying to parse the code and get semantic blocks - that is, on the condition of checks, write the first h2 into a variable, say, section 1, the paragraphs after it into the variable content of section 1, and when the second h2 appears on the horizon, do everything the same, only write it down data to the variable section 2. Is it possible to implement similar logic at the bs4 level?
Example:

<h2>Заголовок раздела</h2>
<p>Какой-то контент</p>
<p>Какой-то контент</p>
<p>Какой-то контент</p>
<p>Какой-то контент</p>
<p>Какой-то контент</p>

<h2>Другой заголовок раздела</h2>
<p>Опять какой-то контент</p>
<p>Опять какой-то контент</p>
<p>Опять какой-то контент</p>
<p>Опять какой-то контент</p>
<p>Опять какой-то контент</p>

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

N

Neron, 2016-02-24
@dbmb

Maybe some Open Sans black. Doesn't look like Raleway at all.

V

Vindicar, 2022-02-14
@Vindicar

If all this beauty lies in one parent (well, or you can put it all in one list in the correct order), then it is trivial.

sections = []
current_section = None
paragraphs = []
for tag in tags: # перебираем теги, которые надо обработать
    if tag.name == 'h2':
        if current_section is not None:
            sections.append( (current_section, paragraphs) )
        current_section = tag
        paragraphs = []
    elif tag.name == 'p':
        paragraphs.append(tag)
if current_section is not None:
    sections.append( (current_section, paragraphs) )

secitons will contain a list of tuples of the form (title, list of paragraphs) in the order in which they appear in the text. Your concern is to ensure that the tags value is correct.

A

AVKor, 2022-02-19
@AVKor

from bs4 import BeautifulSoup

OLD_DOC = '''
<html>
<body>
<h2>Заголовок раздела</h2>
<p>Какой-то контент 1</p>
<p>Какой-то контент 2</p>
<p>Какой-то контент 3</p>
<p>Какой-то контент 4</p>
<p>Какой-то контент 5</p>

<h2>Другой заголовок раздела</h2>
<p>Опять какой-то контент 1</p>
<p>Опять какой-то контент 2</p>
<p>Опять какой-то контент 3</p>
<p>Опять какой-то контент 4</p>
<p>Опять какой-то контент 5</p>

<h2>Ещё один заголовок раздела</h2>
<p>Ещё один какой-то контент 1</p>
<p>Ещё один какой-то контент 2</p>
<p>Ещё один какой-то контент 3</p>
<p>Ещё один какой-то контент 4</p>
<p>Ещё один какой-то контент 5</p>
</body>
</html>
'''

NEW_DOC = ''
part_start = OLD_DOC.rfind("<h2>")
while part_start != -1:
    part_stop = OLD_DOC.find('</body>')
    part = OLD_DOC[part_start:part_stop].strip()
    NEW_DOC = f'<div>\n{part}\n</div>\n{NEW_DOC}'
    OLD_DOC = OLD_DOC.replace(part,"")
    part_start = OLD_DOC.rfind("<h2>")

soup = BeautifulSoup(NEW_DOC, 'lxml')
data = soup.find_all('div')
# далее по вкусу