Is there a ready-made algorithm in python for finding frequently occurring sequences?

D

Dima Pryadkin2018-02-13 09:47:58

Python

Dima Pryadkin, 2018-02-13 09:47:58

I am looking for an algorithm in near-Python, which, by a set of words, will select the most frequently occurring parts of the maximum length, or in advance of a given length.
For example, here are the words (there are many more): BSNREORG, BSNWA010,
At the output I want something like this: BSN (occurs in 100% of cases), REORG (50%), WA010 (50%).

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

L

longclaps, 2018-02-13
@Bangybug

A word of length n contains (n+1)*n/2 non-empty substrings; on a text of m words, the upper estimate is m*(n+1)*n/2 different substrings.
The task is PN-complete (well, what you wrote is generally a hell of a thing, and not a task - what the hell is "the most common" and how does it compare with "maximum length"?), So, the problem is solved by exhaustive enumeration.
Will the muzzle crack?
ps here is some dirty code that looks for the maximum substrings that occur in at least a couple of words. Horror-horror.

from collections import defaultdict
from pprint import pprint

data = [w.strip() for w in "BSNREORG, BSNWA010".split(',')]
alphabet = set(c for w in data for c in w)

nxt = {}
for c in alphabet:
    tmp = [w for w in data if c in w]
    if len(tmp) > 1:
        nxt[c] = tmp
pprint(nxt)
while nxt:
    cur, nxt = nxt, defaultdict(set)
    for k, v in cur.items():
        for c in alphabet:
            for pattern in {k + c, c + k}:
                nxt[pattern].update(w for w in v if pattern in w)
    nxt = {k: v for k, v in nxt.items() if len(v) > 1}
pprint(dict(cur))

V

Vladimir, 2018-02-13
@vintello

I think you need to use a library that is designed for this - NLTK
there are a bunch of examples here , look and choose what suits you best

D

Dima Pryadkin, 2018-02-13
@Bangybug

You can't do without enumeration, but you can speed it up if necessary.
Your estimates are not entirely correct (you need to add 1 more to the first one), but this is not important.
It seemed to me that there is still an algorithm for splitting sequences of characters into parts. I do not need a clear algorithm, so I wrote the word "like".
I'll see what I can get from the histogram for sequences of two letters throughout the text.