How can you parse an html page in bash?

A

Andrey Strelkov2020-04-13 21:19:14

bash

Andrey Strelkov, 2020-04-13 21:19:14

Good afternoon, please tell me, I cycle through the list of URLs whose contents (html source code) I load.
The task is such that you need to pull out a certain piece of text that is in a certain container, for example

<div class="text-container">
  <p>Нужный некий параграф</p>
  <p>Снова нужный некий параграф</p>
  <aside>Не нужный контейнер</aside>
  <div>Снова не нужный контейнер</div>
  <p>Опять нужный параграф</p>
</div>

Those. at the output, you need to get the content in the text-container container, while only paragraphs, i.e.

<p>Нужный некий параграф</p>
  <p>Снова нужный некий параграф</p>
  <p>Опять нужный параграф</p>

Moreover, if there are also various other containers inside the paragraph, then exclude them, you also need to clean up various tags like a, strong , etc.

In other words, leave only p and br , i.e. only text, paragraphs and line breaks

What is the correct way to do this kind of parsing in bash?

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

V

Viktor Taran, 2020-04-13
@shambler81

1. wget curl + sed awk grep option
2. given that you are asking such a simple question, this
https://chrome.google.com/webstore/detail/web-scra...

X

xotkot, 2020-04-14
@xotkot

pup