Parsing the site (its content) from the web archive. How?

M

mRelby2021-06-10 17:22:25

Parsing

mRelby, 2021-06-10 17:22:25

Good day to all!

Actually, the question is right in the title. What is the best way to pull content (or the site itself) from the web archive today?
Perhaps someone has experience, share the buns.

Thanks in advance.

ps. maybe there is some python library for this case. It would be even better.

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

W

weranda, 2021-06-12
@weranda

The Wayback Machine Downloader is called a contraption - if you copy everything, and if you parse, that is, take it apart, then there are a lot of options, for example lxml (it seems to be used inside BeautifulSoup and Scrapy).

I

Igor, 2021-06-17
@hurgadan

as an option https://github.com/puppeteer/puppeteer , for site parsing. I really don't know what you mean by "web archive"