E
E
Eugene2021-12-13 11:36:02
JavaScript
Eugene, 2021-12-13 11:36:02

How to convert HTML string to array?

We have a wiki in English on the project. language, in which we want to automatically add Russian translations via the Deepl API.
And the database text is stored as HTML.
The task seemed simple, just pass this string to Deepl and store the translation in the database. The translator himself sees the html markup well and returns a good translation in 90% of cases. However, if there is code in html (code tag), then, for example, MySQL code, the translator translates it too. We decided to split the html into tags and convert it to an array.

const jsdom = require('jsdom')
const dom = new JSDOM(html)
let arr = [...dom.window.document.body.children].map((child) => child.outerHTML)

At the output we have something like:
[
  "<p>Пример</p>",
  "<co d e class="hljs language-css">.test {text-decoration: none;}</co d e>",
  "<co d e class="hljs javascript">console.log('Test');</co d e>",
  "<p>Пример</p>"
]

As a result, we get an array and we can bypass it, and translate only what is needed. Everything worked fine until we noticed that sometimes pieces of code began to disappear from the code that is in the code tag . This happens especially often with html code.

The question is, can anyone suggest a more universal way to translate an html string into which there is text and code?

Thanks in advance!

UPD: I'll clarify the problem. There is an html line. For example, I copied from StackOverflow:
<p>This is my docker-compose.yml file</p>
<pre><code>version: &quot;3.7&quot;
services: 
  db:
    build: database
    container_name: db
    environment:
      POSTGRES_USER: ${POSTGRES_USER}
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
      POSTGRES_DB: ${POSTGRES_DB}
    volumes:
      - ./pgdata:/var/lib/postgressql/data
      - ./database/migrations:/migrations
    ports:
      - &quot;5432:5432&quot;
</code></pre>
<p>My end objective is that it should first run the parent's entry point command and then run mine.</p>


How to get an array of the form from it:
const arr = [
  "<p>This is my docker-compose.yml file</p>",
  "<pre><code>version: &quot;3.7&quot;
services: 
  db:
    build: database
    container_name: db
    environment:
      POSTGRES_USER: ${POSTGRES_USER}
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
      POSTGRES_DB: ${POSTGRES_DB}
    volumes:
      - ./pgdata:/var/lib/postgressql/data
      - ./database/migrations:/migrations
    ports:
      - &quot;5432:5432&quot;
</code></pre>",
  "<p>My end objective is that it should first run the parent's entry point command and then run mine.</p>"
];


At the same time, if there is more html inside the code tag, then it should not be hooked. Because that's exactly what's happening right now.

Answer the question

In order to leave comments, you need to log in

1 answer(s)
A
Alexander Smirnov, 2021-12-13
@sasha-hohloma

I recommend reading something about working with tree data structures or about parsing a JSON string, because your task is just about that. I can advise you to use the htmlparser2 library , where you can control parsing by the necessary tags or their attributes. As an experiment, you can try to cut , but the main thing is not to forget the position, so that you can put it back later <code>*</code>

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question