V
V
VaniaPythonToster2019-03-24 23:09:27
Python
VaniaPythonToster, 2019-03-24 23:09:27

How to get HTML tags from a string in Python?

There is a task: to get only html tags from the text.
For example,

<html> 
  <head>
    <title>Test</title>
  </head>
  <body class="body" style="color: red;">
    <p id="1">Test</p>
    <p id="2">Test</p>
  </body>
</html>

The result should be
<html><head><title></title></head><body><p></p><p></p></body></html>

You also need to remove all attributes inside the tag. Wrote my own solution, but it works for a long time. One web page takes about 0.5 seconds. Maybe someone knows ready-made built-in methods in bs4, selenium or any other libraries?

Answer the question

In order to leave comments, you need to log in

1 answer(s)
L
longclaps, 2019-03-24
@VaniaPythonToster

I know ready methods built into python.

import re

s = """
<html> 
  <head>
    <title>Test</title>
  </head>
  <body class="body" style="color: red;">
    <p id="1">Test</p>
    <p id="2">Test</p>
  </body>
</html>"""

print(''.join(re.findall(r'</?[a-z]\w*\b|>', s, flags=re.I | re.M)))

And what are you doing here?

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question