How to calculate the probability that a particular substring occurs only once in the entire string?

M

Magneto9032020-11-26 09:00:05

Mathematics

Magneto903, 2020-11-26 09:00:05

In the alphabet I have 4 letters (let A, B, C, D)
For each letter there is a probability, let it be p1, p2, p3, p4.
Next I have a word, length N .
And it contains some substring of length M . I need to find the probability that this substring occurs exactly once in a string .

How can I calculate this probability?

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

W

Wataru, 2020-11-26
@Magneto903

Here we need to build a state machine that accepts strings that end in a given string. Look at this article , this automaton is painted there at the beginning (sections prefix function - KMP atom).
Only there all the transition probabilities are the same, but you have them set for each letter.
So you built an automaton. Now the task is to find the probability that a random path in this automaton of length n passes through the final state exactly once. To do this, calculate 2 probabilities: that the path from the beginning of length k will reach the final state once, and that the path from the final state of length nk will not return to it.
Both of these probabilities can be calculated by dynamic programming:
P1(i, k) is the probability that the path starting from state i (i < n) in k steps will reach state n for the first time. It's just the sum over all possible transitions:

P1(i, k) = sum_{c - все символы} P1(next(i, c), k-1) * p(c)

Base:

P1(m, 0) = 1
P1(m, k>0) = 0
P1(i < m, 0) = 0

Second probability: take k steps from state i and never enter the final state:

P2(i, k) = sum_{c - все символы, next(i,c) < m} P2(next(i, c), k-1) * p(c)

Base:
P1(i, 0) = 1
The answer to the problem is the sum over all possible lengths of the first part of the path:
sum_{k=m..n} P1(0, k) * P2(m, n-k)
This solution through dynamic programming will be O(n*m) in time and memory.
I don’t see a closed formula, as in the problem in my article. Maybe if the probabilities of all symbols were the same, then something could be shortened.

N

Nikolai Chuprik, 2020-11-26
@choupa

I can only consider the case when ALL letters in the substring are different. This means that substrings are guaranteed not to overlap. Those. the case of the word "RAKOKOKSHY" and the substring "KOK" is impossible.
Let Pw(x) = p1*p2*p3*p4*p5*...*pM is the probability that a substring of length M starts from position number x, where pi is the probability of each a specific letter in the substring (in its place). All Pw(x) are equal from x = 1 to x = N-M+1 (for large x, the substring simply does not fit to the end of the word in length). However, the conditional probability Pw(y) = 0 if the substring actually starts at position x, and the position y is less than M from x (the substrings do not overlap).
Let ^Pw(x) = (1 - Pw(x)) be the probability of NOT encountering a substring starting at position x.
Then the probability that the substring starts at position 1 and does not occur anywhere else:
Pw(1) * [ ^Pw(M+1) * ^Pw(M+2) * ... ^Pw(N-M+1) ] = Pw * (N-2M+2) * ^Pw
for 2nd position:
^Pw(1) * Pw(2) * [ (N-2M+1) * ^Pw ], which is the same as for the first position, just the ^Pw multiplier moved forward from the [ ] brackets.
similarly for i-th position:
^Pw * (i-1) * [ Pw(i) * (N-2M+3-i) * ^Pw ] = Pw * ^Pw * (N-2M+2)
Now sum this probability for all positions from 1 to N-M+1
P = Pw*^Pw*(N-2M+2)*(N-M+1) ,
but for the case N < 2M it is still simpler (twice a substring in a word just does not fit with all the desire):
P \u003d Pw * (N-M + 1)
If substrings can overlap, then complex conditional probabilities (correlations) arise there, and in general, muddle begins.