What are the patterns or algorithms for text processing?

C

Cat Anton2018-02-04 23:34:04

HTML

Cat Anton, 2018-02-04 23:34:04

I'm going to write a library that will perform a large number of different transformations with HTML text:
1. Convert text encoding to UTF-8, normalize Unicode;
2. Wrap all links in HTML tags <a>...</a>(if the link is not already in tags);
3. Arrange skipped not closed HTML tags;
4. Place non-breaking spaces after prepositions;
5. Replace hyphens with dashes where necessary;
6. Remove prohibited HTML tags;
7. Replace some combinations of symbols with images (for emoticons);
...and hundreds of other transformations.
As you can see, some transformations are carried out at the level of the entire text, others at the level of individual HTML tags, and still others at the level of individual characters. For some rules, the order in which they are applied is important: some rules must be executed before others.
The output should be a mixture of a typographer and the well-known HTML Purifier . The code should be easy to extend (add custom transformations) with third party plugins.
The simplest solution is to perform each transformation separately in the right order: in some cases, find the necessary characters in the text and replace them ( example from the typographer Muravyov), in some cases, use regular expressions ( example). That is, each rule takes and processes the entire text in one way or another. On the one hand, such code is easy to understand, easy to maintain, but hard to expand, and with a large number of transformations, everything works very slowly. And the worst thing is that with this approach, making changes to one transformation can affect others and you have to break the transformation into different stages.
How can such a library be architecturally correct? Maybe there are some patterns for this?

Reply

Answer the question

In order to leave comments, you need to log in

4 answer(s)

D

D3lphi, 2018-02-05
@D3lphi

I will suggest the following procedure:

The lexical analyzer breaks the stream of incoming tokens into tokens.
Further, the set of tokens is input to the parser, which builds a "raw" DOM tree, which may contain invalid tags, and so on.
The raw DOM tree is bypassed and normalized. We change tag names to valid names as close as possible to them, put down missing tags, delete forbidden ones, etc.
On the normalized tree, you can now traverse and transform the text.
We transform the finished tree into an html document.

A

asd111, 2018-02-05
@asd111

Judging by what you want to learn, I would advise you to learn ANTLR . This is a very high-quality generator of lexers and parsers, although it is in java, but this is not a problem.
There is already a ready-made grammar for html . If you don't like it, you can make your own.
According to this grammar, it builds a tree, and to traverse the tree, you can quickly write visitor or something simpler. It is also easy to write error handling there (lack of a closing tag, etc.).

R

RuWeb, 2019-04-13
@RuWeb

Here is a ready-made online service TextTools.ru

X

xmoonlight, 2018-02-04
@xmoonlight

1. First you need to "build" a correct/valid DOM tree.
2. Then, add all the required tags
3. Remove all forbidden tags
4. Then, recursively traverse all branches of the DOM "tree", performing text transformations.
The order in general terms (strictly observing the sequence of actions!):
Validation of the structure, adding the necessary elements, removing prohibited elements, modifying the "body" of the remaining elements.