Answer the question
In order to leave comments, you need to log in
How to remove duplicate tags from HTML string?
Hello.
I'm parsing the content of a page parsed using Python 3 and Beautiful Soup. The problem is that the page may have duplicate HTML tags that are located in different branches of the DOM (I do not know their position in advance). Duplicate tags must be removed (but the original tag left) in order to extract the text inside the tags from HTML later. For example, this could be inside the soup object:
<div class="act_first">
<div class="stats">
<div class="key">COLOR</div>
<div class="value">Brown</div>
</div>
</div>
<div id="stats" class="not_visible">
<div class="stats">
<div class="key">COLOR</div>
<div class="value">Brown</div>
</div>
</div>
<div class="stats">
<div class="key">COLOR</div>
<div class="value">Brown</div>
</div>
<div class="act_first">
<div class="stats">
<div class="key">COLOR</div>
<div class="value">Brown</div>
</div>
</div>
<div id="stats" class="not_visible">
</div>
# Получаем список всех тэгов со страницы
all_tags = soup.find_all()
# Список из тэгов после чистки
all_tags_list = []
# Конвертируем список тэгов в список
for tag in all_tags:
all_tags_list.append(str(tag))
print('Всего тэгов на странице', len(all_tags_list))
tags_before_clear = all_tags_list
# Удаляем дубли тэгов, сравнивая каждый с каждым
for tag in all_tags_list:
for tag_1 in all_tags_list:
# Если тэги совпадают, удаляем второй тэг
if tag == tag_1:
tags_before_clear.remove(tag)
print('Тэгов после чистки', len(tags_before_clear))
# Склеиваем список в строку
tags_before_clear_str = ' '.join(tags_before_clear)
# Снова создаем объект SOUP
soup = BeautifulSoup(tags_before_clear_str, 'html5lib')
# Извлекаем только текст из тэгов и переводим в нижний регистр
soup_text = soup.text.lower()
Answer the question
In order to leave comments, you need to log in
Use css selectors or xpath. From your example:
As a result:
<div class="act_first">
<div class="stats">
<div class="key">COLOR</div>
<div class="value">Brown</div>
</div>
</div>
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question