How to parse keeping styles (headings, highlights, paragraphs, etc.)?

R

rkfddf2021-01-19 18:54:07

Python

rkfddf, 2021-01-19 18:54:07

How to parse a site page while maintaining styles (headings, highlights, paragraphs, etc.) - that is, there is a page and collect everything from it, and then transfer it to another site without additional editing. And I would like to transfer the pictures to the right places immediately. I transfer to wordpress, collect through selenium python, the volume is large, and can it be simplified. There is no access to the database.

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

D

Daria Motorina, 2021-01-19
@rkfddf

Wget was mentioned in the comments, it turns out that it is really possible to download all files via wget (example 9 from the article)
https://m.habr.com/ru/company/ruvds/blog/346640/
If this is not suitable or not enough, then in principle selenium on the other hand, it can access any node of the DOM tree and can get the text along with the layout (stylization from Wysiwyg-s). You can also get src attribute values from img tags and download files by URL. I can’t give examples of python code, but I know for sure that this is possible and easy enough to do, the main thing is to find the selectors of these elements)