How to remove duplicate tags from HTML string?

E

Evgeny_A2019-09-25 11:09:02

Python

Evgeny_A, 2019-09-25 11:09:02

Hello.
I'm parsing the content of a page parsed using Python 3 and Beautiful Soup. The problem is that the page may have duplicate HTML tags that are located in different branches of the DOM (I do not know their position in advance). Duplicate tags must be removed (but the original tag left) in order to extract the text inside the tags from HTML later. For example, this could be inside the soup object:

<div class="act_first">
  <div class="stats">

    <div class="key">COLOR</div>
    <div class="value">Brown</div>

  </div>
</div>

<div id="stats" class="not_visible">
  <div class="stats">
    
    <div class="key">COLOR</div>
    <div class="value">Brown</div>
    
  </div>
</div>

You need to remove this:

<div class="stats">

    <div class="key">COLOR</div>
    <div class="value">Brown</div>

  </div>

To leave this:

<div class="act_first">
  <div class="stats">

    <div class="key">COLOR</div>
    <div class="value">Brown</div>

  </div>
</div>

<div id="stats" class="not_visible">

</div>

I posted the code with detailed comments:

# Получаем список всех тэгов со страницы
      all_tags = soup.find_all()

      # Список из тэгов после чистки
      all_tags_list = []

      # Конвертируем список тэгов в список
      for tag in all_tags:

        all_tags_list.append(str(tag))

      print('Всего тэгов на странице', len(all_tags_list))

      tags_before_clear = all_tags_list

      # Удаляем дубли тэгов, сравнивая каждый с каждым
      for tag in all_tags_list:

        for tag_1 in all_tags_list:

          # Если тэги совпадают, удаляем второй тэг
          if tag == tag_1:

            tags_before_clear.remove(tag)

      print('Тэгов после чистки', len(tags_before_clear))

      # Склеиваем список в строку
      tags_before_clear_str = ' '.join(tags_before_clear)

      # Снова создаем объект SOUP
      soup = BeautifulSoup(tags_before_clear_str, 'html5lib')

      # Извлекаем только текст из тэгов и переводим в нижний регистр
      soup_text = soup.text.lower()

I do not have enough understanding of what the cycle should be, so that in the end I have a list with unique tags whose order is not violated.

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

A

Andrey_Dolg, 2019-09-27
@Andrey_Dolg

Use css selectors or xpath. From your example: As a result:

<div class="act_first">
  <div class="stats">

    <div class="key">COLOR</div>
    <div class="value">Brown</div>

  </div>
</div>

BeautifulSoup is designed so that you don't treat the page as if it were text.
It is doubtful that it is impossible to select something one using a class.
Also, if you want to avoid duplicates, use sets.