Z
Z
zooh2011-09-11 12:24:53
natural language processing
zooh, 2011-09-11 12:24:53

Recognition of meaningful text?

Good afternoon!
The question arose: are there known mechanisms to distinguish a random set of characters from a more or less literary text in a given language? In what direction to dig? So far, I have only come up with the idea of ​​collecting statistics on arrays of "live" texts: the frequencies of individual characters, twos and threes, and then calculating the Pearson correlation coefficient. Who is good at mathematical statistics, can you suggest more advanced methods of analysis?

Answer the question

In order to leave comments, you need to log in

2 answer(s)
Y
YasonBy, 2011-09-11
@YasonBy

Your idea is quite right to life. Actually, you invented the N-gram method :)
If you want to use something more complex (is it worth it?), here is an article with an overview of the three methods ( pdf , eng.) A selection of articles on the topic can be found here . On a more popular level, and in Russian - here .
The complexity of the required approach depends on your task. For example, if accuracy is important, Markov chains can be used. That is, we take “War and Peace” and collect statistics: how often the letter x N occurs behind the sequence of letters x 1 ... x N -1? N is about 3..4. Then we take the experimental text, run through it, multiplying the probabilities. As a result, we obtain the probability that the subject is a meaningful text in Russian.
If speed is more important than accuracy, one can replace the probabilities with Boolean values: does the sequence of letters x 1 ... x N occur at least once in War and Peace ?

R
rPman, 2011-09-11
@rPman

Random character set - this is detected by dictionaries (slightly tighten support for endings, prefixes, etc.), as I remember, open-source spell check libraries contain the necessary algorithms and databases.
Identification of simple sentences in a human language can be done by statistical analysis, but this is an order of magnitude more difficult and also does not solve anything, since you cannot highlight meaningful text in this way ...
Monsters like Abbyy take billions of grants to develop such algorithms, I'm afraid you don't have much more chances of developing successful algorithms.
ps try to understand if this text makes sense, it's just a classic (taken from here , also here ):

Born on Herzen Street. In grocery store No. 22. Well-known economist. Librarian by calling. The people are collective farmers. The store is a salesperson. In the economy, so to speak, it is necessary. This is, so to speak, a system ... uh ... consisting of 120 units. Take pictures of the Murmansk Peninsula and get te-le-fun-ken. And the accountant works on a different line. Through the Library. Because there will be no air, but there will be an academician! Well, here you can take a picture of the Murmansk Peninsula. You can become an air ace. You can become an air planet. And you will be sure that this planet will be accepted according to the textbook. This means that one planet will benefit physics. The value - torn off in the area of ​​diplomacy - gives its fluctuations to all diplomacy. And Ilya Muromets gives hesitation only to his family. The match in the library works. He goes to the newsreel and lights a large sheet in the newsreel. In the library, a small sheet kindles. The fire will… uh… develop much more easily than a strong textbook. A strong textbook will be more weighty than a grocery store on Herzen Street. And on Herzen Street there will be a split textbook. Then the textbook will pass through Herzen Street, through grocery store No. 22, and be replaced there according to the formula of economic unity. Here in store 22, it can split, the economy! For economists, for dispatchers, for sellers, for the culture of trade ... So, the whole economy is moving in this direction. The library will move in the direction of 120 units, which will ... uh ... stack item on item. 120 units is the subject of physics. An electric bulb burns from 120 bricks, because its structure, so to speak, is similar to a brick. Ilya Muromets works at the Dynamo stadium. Ilya Muromets works at home. That's real diplomacy! "Open diplomacy" is the same. Well, we take a TV, insert it into the Murmansk Peninsula, wind it up, there ... uh ... black bread all the time ... Well, will Muromets, or what, grow up? Ilya Muromets, perhaps, will grow out of this?

But black SEOs generate much more interesting texts.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question