How to implement recognition of matching Russian and English characters?

W

WTFRU72014-02-22 14:15:31

Text recognising

WTFRU7, 2014-02-22 14:15:31

Started learning OCR. The question became as follows: what algorithms are there to distinguish Russian characters from English ones. It's no secret that in both languages there are characters with the same spelling:
habrahabr - English letters
An excellent resource - Russian letters (a grammatical error was made on purpose)
So, in this case, both there and there there is a letter "a", which can be recognized, both English and Russian, which, when the recognized text is subsequently written in a font in which these letters are different, will be clearly reflected to the user. For example, handwritten fonts.
How can such a question be resolved? Maybe someone knows the algorithms?
So far, the following comes to mind: to have images for all Russian letters, and from the images for English letters to remove everything that matches Russian. Further, when the word is recognized, check if it contains English letters, and if there are, change all Russian letters to their English counterpart. For example, the word hAbrAhAbr (capital letters are those that are recognized in this case as Russian). We check: we see English letters in the word, which means that this is an English word, so we change the characters A to the corresponding English ones. Something like this. But what happens if the word is mixed, for example, the name of the company: boyarinъ - it is clear that like "boyar" should be written in English letters, but a solid sign in Russian, that is, my algorithm will no longer work.

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

Y

Yuri Lobanov, 2014-02-22
@WTFRU7

In addition, in your algorithm, how to recognize the words:
a a
on op
no by
c (single ce in the letter c sentence) with
moon toop
Examples are so-so, but in your version there will be words that will consist entirely of Russian letters, so you have to look at the context, so to speak, whether you want it or not)

Y

Yuri Lobanov, 2014-02-22
@iiil

in case of the exceptions you indicated, make a choice of the recognition language so that you can force the inclusion of Russian or English.
although in the example with the boyar I would do this: if there are English characters in the word, then replace with English all Russians that look like English, respectively, the solid sign will not be replaced.

D

Developer, 2014-03-17
@samodum

Post-processing is required
Look at the article habrahabr.ru/post/86303
The essence of the algorithm: if Russian characters are found in a word (f, f, d, yu, ...), then the whole word is considered Russian and all the letters o, a, e will be Russian here.
The same for English