Determining the language of the text

Maxim2010-11-30 09:41:20

English learning tools

Maxim, 2010-11-30 09:41:20

Initial data: there are hundreds of thousands of small texts written in all languages known to science.
Purpose: to leave only those of the texts that are written in Russian or English, discard the rest.

As I do now:
1. Using PCRE, I throw out everything from the text, except for letters (\p{^L}).
2. I also remove Russian and English letters ([a-za-z]).
3. If something is left, I consider the text to be neither Russian nor English, and discard it accordingly.

In the current scenario, there are both false positive and false negative errors, which is frustrating.
First: in a German or French, for example, text, by an unfortunate accident there may not be a single umlaut and it will be considered English.
Secondly, in a correct Russian or English text, there may be some, for example, a proper name with an umlaut, or a quote from other languages - the text will be erroneously discarded.

Question: apart from 100% language recognition (let's leave it to expert systems and other AI), is it possible to reduce the number of recognition errors? Interested in ready-made libraries (PHP, perl) / public web services or an algorithm that is quite simple to implement.

Answer the question

In order to leave comments, you need to log in

8 answer(s)

Alexxander, 2010-11-30
@Alexxander

1. For small texts, 100% recognition will be impossible in principle.
2. To improve recognition, it is necessary to create an expert system with a database of words and frequencies of different languages.
But it may be possible to use Google translator using the API or something else.
An overview of language qualifiers is available here . Maybe some have an API.

GeniyZ, 2010-11-30
@GeniyZ

You can compare the frequency characteristics of texts.
www.statsoft.ru/home/portal/exchange/textanalysis.htm
As you can see - the same letters are used with different frequency in different languages - due to this, language recognition can be improved. And to separate seemingly inseparable =) (with a sufficient amount of text, of course)

lugansk, 2010-11-30
@lugansk

>> 1. Using PCRE, I remove everything from the text except letters (\p{^L}).
>> 2. I also remove Russian and English letters ([a-za-z]).
>> 3. If something is left, I consider the text to be neither Russian nor English, and discard it accordingly.
There may be borrowed words in the English text that retain the original spelling café, Übermensch, etc. In addition, text in a language with the Latin alphabet can be typed without diacritics if it is typed on a computer that has only the English layout installed.
Make for each language you need a list of function words, pronouns, etc., common in it, which are not used in other languages, and check their presence in the text.
For example, articles, pronouns and auxiliary verbs are great for Italian, German, French (For example, German: ein, eine, eines, einem ... der, die, das, dem, den ... bin, bist, ist, war, wurde ... ich, er, sie..., also prefix ge- at the end -en or -t, etc.). Just don’t trust one word you find 100%, for example, bin (German = “is”) is in English (for example, “recycle bin”, in general, it’s fun, probably, for Germans to learn English ... compare the meanings of the words mist in these languages , after, gift).
In addition, you can add a probability by finding combinations of letters typical for a given language (for German, sch, ei...). To determine Ukrainian, in addition to the presence of є, ї, і, ґ in the text, the absence of ы and ъ, you can use the search for і as a union.
If there are few languages, then it is easy to collect data on their features.
You can also experiment with the Google Language API ( example ).
In addition, you can google "language identifier", maybe there is something ready-made suitable.

[

[email protected]><e, 2010-11-30
@barmaley_exe

And if you count the frequency of characters? Those. calculate the percentage of Russian letters, English and others?
This will not save you from French without umlauts, but it will help with the Russian text.
Next, check for English: look in the text for the words the, is, a (Maybe something else often used). I do not know if they are in other languages, but in the text in English they should be.

Mikhail Rozhkov, 2010-11-30
@shogunkub

As for proper names, you need to analyze not only the frequency of characters, but also their localization. Those. if we still have some letters in addition to Russian and English, we look for them in the text. If they are nearby, and their percentage is less than the “trip threshold” (coefficients for “nearby” and “trigger threshold” are selected experimentally), then we ignore these extraneous letters when determining the language.

Dmitry Guketlev, 2010-11-30
@Yavanosta

look at this topic, I think the idea can be very easily adapted to determine the language.
habrahabr.ru/blogs/php/107945/

Atrax, 2010-11-30
@Atrax

“In the order of delirium”: what if we conduct a frequency analysis of characters from different languages? And compare the frequency analysis of the example with the "profile" of the language. It seems to me that it will even be possible to distinguish Russian from Bulgarian in this way. It's just a matter of statistical comparison. Perhaps all this is nonsense - but you can try, "an autopsy will show."

Atrax, 2010-11-30
@Atrax

Well… in fact, I was not aware of it :) cycling is the fate of any php specialist…