How to find html tags in txt file?

A

Artem2018-05-20 23:09:23

Python

Artem, 2018-05-20 23:09:23

Good afternoon! Help with this problem. There is a txt file that contains both plain text and several tables in the form of <table> ... </table>. As an example:

"
ACCESSION NUMBER:		0000796343-18-000015
CONFORMED SUBMISSION TYPE:	10-K
PUBLIC DOCUMENT COUNT:		109
CONFORMED PERIOD OF REPORT:	20171201
FILED AS OF DATE:		20180122
DATE AS OF CHANGE:		20180122

<table id=1>
    <tr>
        <td>Some Text</td>
    </tr>
</table>

<table id=2>
    <tr>
        <td>Some Text</td>
    </tr>
</table>
"

How can I extract these tables from the text (i.e. the table tags themselves with content)? I guess with regexp and re.findall, but I can't figure out the correct expression... Please help.

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

D

DDDsa, 2018-05-22
@Malodar

If any tag, then like this:
This will return two values, the tag and its contents:

>>> tables = re.findall(r'<(\w+)([\s\S]+?)<\/\1>', s)
>>> tables
[('table', ' id=1>\n    <tr>\n        <td>Some Text</td>\n    </tr>\n'), ('table', ' id=2>\n    <tr>\n        <td>Some Text</td>\n    </tr>\n')]

If you need to pull out only the table tag (or there will definitely not be others), then like this:
Example:

>>> tables = re.findall(r'<table([\s\S]+?)<\/table>', s)
>>> tables
[' id=1>\n    <tr>\n        <td>Some Text</td>\n    </tr>\n', ' id=2>\n    <tr>\n        <td>Some Text</td>\n    </tr>\n']

A

Alexander Taratin, 2018-05-20
@Taraflex

Wrap all text in <body>
Next https://stackoverflow.com/questions/3051295/jquery...