How to distinguish a "word" from a meaningless set of characters?

A

Alexey Dabalaev2016-03-12 20:42:30

Algorithms

Alexey Dabalaev, 2016-03-12 20:42:30

Hello.
Guys, please tell me what tools (without reference to programming languages) can be used to determine that the entered characters are a word, and not a meaningless set of characters "vldatdukyvta call".
Enumeration of previously prepared dictionaries is not a solution to the problem.
And one more question. Is it possible to use Probability Theory here?

Reply

Answer the question

In order to leave comments, you need to log in

4 answer(s)

D

DarkMatter, 2016-03-12
@darkmatter

Bigram analysis, trigram analysis, language impossibility analysis

D

Dimonchik, 2016-03-12
@dimonchik2013

in general, the problem is solved by enumeration of previously prepared corpora,
but if you want algorithms:
1) the frequency of letters in the language (channels for large texts)
2) combinations of letters in the language (channels for almost even words)
naturally, both options lose to corpora

X

xmoonlight, 2017-01-23
@xmoonlight

There is an idea to check by counting the weights of language possibilities by N-grams:
1. Number of N-grams of impossibilities of letter combinations of the language.
2. Number of N-grams of the possibilities of letter combinations of the language.
If item 1 - "<=33%" (why not 0: error tolerance in the word) and item 2 - ">=67%" - this is a word.
Otherwise - no.
(IMHO + these are my conjectures, I have not checked it myself yet)

M

Mercury13, 2018-01-24
@Mercury13

The easiest way that came to mind.
Present the word as a Markov chain (this is exactly what theorver). There are transition probabilities in Zhelnikov's famous book (and not only there, I'm sure).
Then we calculate the probability of going through the Markov chain exactly along this path. If the probability is too small - just a random character set.
Without previously prepared dictionaries, nowhere - for example, the words "Shymkent" or "parachute" with "impossible" letter combinations (their probability in the CM will, of course, be zero).