How to parse html file into multiple files?

S

sazhyk2017-10-16 08:47:23

Python

sazhyk, 2017-10-16 08:47:23

Please help me with static html parsing.
There is a file, quite unusual, and, from the point of view of standards, with a bunch of violations. But not about that now.

Actually file

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict/EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html lang="en">
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
        <title>Parse Me</title>
    </head>
    <body>
        <div id="my_id">
            <!-- Something -->
        </div>
        <div id="my_id">
            <!-- Something -->
        </div>
        <!-- Много <div id="my_id">  -->
        <div id="my_id">
            <!-- Something -->
        </div>
    </body>
</html>

It is necessary to get everything from this file

<div id="my_id">
    <!-- Something -->
</div>

and already from them create new separate html files.

Like these

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict/EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html lang="en">
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
        <title>Parse Me</title>
    </head>
    <body>
        <div id="my_id">
            <!-- Something -->
        </div>
    </body>
</html>

Never dealt with parsing before. I smoked Google, joked on the toaster ... They write like LXML is not bad in this matter. But its documentation is somehow not very clear to me. Help at least with advice, at least with deed - I will be glad to everything.

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

Q

qlkvg, 2017-10-16
@sazhyk

In general, bs4 is considered a bitch here, but I still love it.

spoiler

from bs4 import BeautifulSoup

src = """
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict/EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html lang="en">
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
        <title>Parse Me</title>
    </head>
    <body>
        <div id="my_id">
            <!-- Something -->
        </div>
        <div id="my_id">
            <!-- Something -->
        </div>
        <!-- Много <div id="my_id">  -->
        <div id="my_id">
            <!-- Something -->
        </div>
    </body>
</html>
"""

template = """
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict/EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html lang="en">
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
        <title>Parse Me</title>
    </head>
    <body>
        {}
    </body>
</html>
"""

bs = BeautifulSoup(src)
divs = bs.find_all("div", {"id": "my_id"})
for div in divs:
  print(template.format(div))