A
A
Andrey Strelkov2020-04-13 21:19:14
bash
Andrey Strelkov, 2020-04-13 21:19:14

How can you parse an html page in bash?

Good afternoon, please tell me, I cycle through the list of URLs whose contents (html source code) I load.
The task is such that you need to pull out a certain piece of text that is in a certain container, for example

<div class="text-container">
  <p>Нужный некий параграф</p>
  <p>Снова нужный некий параграф</p>
  <aside>Не нужный контейнер</aside>
  <div>Снова не нужный контейнер</div>
  <p>Опять нужный параграф</p>
</div>


Those. at the output, you need to get the content in the text-container container, while only paragraphs, i.e.

<p>Нужный некий параграф</p>
  <p>Снова нужный некий параграф</p>
  <p>Опять нужный параграф</p>


Moreover, if there are also various other containers inside the paragraph, then exclude them, you also need to clean up various tags like a, strong , etc.

In other words, leave only p and br , i.e. only text, paragraphs and line breaks

What is the correct way to do this kind of parsing in bash?

Answer the question

In order to leave comments, you need to log in

2 answer(s)
V
Viktor Taran, 2020-04-13
@shambler81

1. wget curl + sed awk grep option
2. given that you are asking such a simple question, this
https://chrome.google.com/webstore/detail/web-scra...

X
xotkot, 2020-04-14
@xotkot

pup

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question