C
C
chrispsow2019-06-05 14:00:41
Python
chrispsow, 2019-06-05 14:00:41

How to extract text from a tag and then replace it with Soup Python?

There is this piece of HTML code:

<div class="f-subheader subheader f-subheader-sm" data-editable="true" data-main-class="subheader" data-param="subheader">
            <p>
             holding educativo internacional
            </p>
            <p>
             Academia STANDART LONDRES
            </p>
           </div>
           <div class="f-header header f-header-72" data-editable="true" data-main-class="header" data-param="header">
            <p>
             <br/>
            </p>
            <p>
             <br/>
            </p>
            <p>
             <br/>
            </p>
            <h1>
             Модные курсы и семинары
             <br/>
             парикмахеров, стилистов, визажистов, косметологов И мастеров маникюра
            </h1>
           </div>
           <div class="f-desc description f-desc-xl" data-editable="true" data-main-class="description" data-param="description">
            <p>
             <strong>
              ЕВРОПЕЙСКИЙ СТАНДАРТ ОБУЧЕНИЯ В Мексике и Колумбии
              <br/>
              ОТ ЭКСПЕРТОВ КРАСОТЫ ИЗ ЛОНДОНА
             </strong>
             <br/>
            </p>
           </div>
           <div class="buttons" data-main-class="buttons">
            <button class="btn f-btn btn-success" id="button3504888" style="color: #FFFFFF; background-color: #E31e24; " type="button">
             Ver todos los cursos
            </button>

The text that is not in Russian was successfully extracted, translated and inserted back
. And the one that is not translated was not found and, accordingly, was not processed.
Python script:
soup = Soup(html, features="html.parser")
tags = ['span', 'p', 'b', 'a', 'div', 'li', 'h1', 'h2', 'h3', 'button', 'small', 'strong', 'td', 'img', 'input']

for tag in tags:
  for htmltag in soup.find_all(tag):
    try:
      # print(f'Text: {htmltag.text}, string: {htmltag.string}')
      if htmltag.string and len(htmltag.string) > 0:
        # if tag == 'span' and 'Copyright' in htmltag.string : continue
        # print(f'Tag <{tag}> String: {htmltag.string}')
        translated = translator.translate(htmltag.string, dest=lang)
        print(f'<{tag}> {htmltag.string} > {translated.text}')
        htmltag.string.replace_with(translated.text)
      elif tag == 'img' and 'alt' in htmltag.attrs and len(htmltag["alt"]) > 0:
        # print(f'Tag <{tag}> Alt: {htmltag["alt"]}')
        translated = translator.translate(htmltag['alt'], dest=lang)
        print(f'<{tag}> {htmltag["alt"]} > {translated.text}')
        htmltag['alt'] = translated.text
      elif tag == 'input' and 'placeholder' in htmltag.attrs and len(htmltag["placeholder"]):
        # print(f'Tag <{tag}> Placeholder: {htmltag["placeholder"]}')
        translated = translator.translate(htmltag['placeholder'], dest=lang)
        print(f'<{tag}> {htmltag["placeholder"]} > {translated.text}')
        htmltag['placeholder'] = translated.text
    except Exception as e:
      pass
      print(f'*** ERROR Tag: {tag} , htmltag: {htmltag} , Str: {htmltag.string} / Err: {e} ***')
      errors += 1

Through htmtagl.text it finds the text, but it also finds the code of the <script> tag if it is in the <div> block, which the htmltag.string method does not do.
And through .string, as I understand it, it does not find the text that includes into itself < /br > or something else
How to extract text and then replace it in all tags that contain it?

Answer the question

In order to leave comments, you need to log in

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question