A
A
alekssamos2021-01-31 22:09:08
Text Processing Automation
alekssamos, 2021-01-31 22:09:08

How to make a text with a mixed alphabet normal?

What other tags to put - I do not know.

One friend at work has texts written in a mixed alphabet (Russian and Latin letters), instead of "U" English "Y", instead of Russian "C" English "C" and so on. Has anyone come across this problem and solved it?
Is it possible to come up with different combinations for a dictionary, perhaps even with regular expressions? I tried once, it still turned out crooked. Does anyone have a solution?
Additionally
Мы не видим глазами. И я, и она. Пользуемся говорящей программой экранного доступа, синтезатором речи, голосом, зачитывает вслух. Ещё может использоваться шрифт брайля, там кириллица и латиница по-разному пишется, в одном слове такие перестановки недопустимы. Так вот. Ей важна точность информации и скорость обработки, даже ошибки на одну букву не желательны. Ну и это отвлекает от работы, сбивает с мысли и просто напросто раздражает. Подробностей я не знаю. Пробовала менять синтезаторы, но ни один её из-за этого не устраивает. Важно именно читать текст в реальном времени, а не копировать его куда-то в редакторы, заменять и прочее, но если это невозможно, только такой выход.

Answer the question

In order to leave comments, you need to log in

3 answer(s)
D
Developer, 2021-02-01
@samodum

I solved such a problem, but for a long time, more than 10 years ago.
Here is a link to my article on Habré: https://habr.com/ru/post/86303/
We assume that Cyrillic and Latin cannot be mixed together in one word. The word must consist of either only Cyrillic or Latin. If there is a mixing of alphabets, then you need to bring the word to the desired encoding.
The idea is simple: the program tries to determine the language in which the word is written by determining the occurrence of unambiguously Russian letters, such as Е, Ж, З, Ф, Я, etc., and the same for English: F, L, Q, S, V, W, Z, etc.
After that, all ambiguous letters (A, O, E, Y, Y, X, X ...) are forcibly replaced in the word with the corresponding letters of the language that we have defined.
You can go the other way. Bring the word first to the Latin encoding, then to the Latin alphabet. And check each of the words in the dictionary. If such a word is found there, then apply this word. It will be necessary to refine that my algorithm, I'll do it somehow.
I hope I explained clearly.

A
alekssamos, 2021-02-14
@alekssamos

The code from this article helped. Made addition textnormalizer .

S
Stalker_RED, 2021-02-01
@Stalker_RED

The easiest option is to replace all Latin characters with Cyrillic characters. But this method has a significant drawback - it will replace the letters for the same in normal words written in Latin.
A more difficult option is to find words in which Cyrillic and Latin are mixed, and apply the replacement only to them.
But with a dictionary, you can even cooler - when replacing, check the word and its word forms in the dictionary, and if it is not found, then display a warning or the original spelling in brackets, for example, or whatever is more convenient for you.
If reading from a browser, then you can write an extension or a user script. If from editors like microsoft word, then you can also write VBA scripts there. And surely some screen readers have an API for plugins.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question