Answer the question
In order to leave comments, you need to log in
How to distinguish a "word" from a meaningless set of characters?
Hello.
Guys, please tell me what tools (without reference to programming languages) can be used to determine that the entered characters are a word, and not a meaningless set of characters "vldatdukyvta call".
Enumeration of previously prepared dictionaries is not a solution to the problem.
And one more question. Is it possible to use Probability Theory here?
Answer the question
In order to leave comments, you need to log in
Bigram analysis, trigram analysis, language impossibility analysis
in general, the problem is solved by enumeration of previously prepared corpora,
but if you want algorithms:
1) the frequency of letters in the language (channels for large texts)
2) combinations of letters in the language (channels for almost even words)
naturally, both options lose to corpora
There is an idea to check by counting the weights of language possibilities by N-grams:
1. Number of N-grams of impossibilities of letter combinations of the language.
2. Number of N-grams of the possibilities of letter combinations of the language.
If item 1 - "<=33%" (why not 0: error tolerance in the word) and item 2 - ">=67%" - this is a word.
Otherwise - no.
(IMHO + these are my conjectures, I have not checked it myself yet)
The easiest way that came to mind.
Present the word as a Markov chain (this is exactly what theorver). There are transition probabilities in Zhelnikov's famous book (and not only there, I'm sure).
Then we calculate the probability of going through the Markov chain exactly along this path. If the probability is too small - just a random character set.
Without previously prepared dictionaries, nowhere - for example, the words "Shymkent" or "parachute" with "impossible" letter combinations (their probability in the CM will, of course, be zero).
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question