N
N
Nick V2014-11-04 00:11:01
Python
Nick V, 2014-11-04 00:11:01

Working with anagrams, how can you optimize the script?

The script works according to the following algorithm:

  1. We enter the word
  2. From the entered word we make anagrams
  3. The resulting anagrams are compared with the words from the list
  4. If there is a match, then output to the console

There are 109581 words in the dictionary. The longest word is antidisestablishmentarianism in which the number of letters is 28.
At the moment, if the entered word has more than 8 letters, then the script processes the data for a very long time. If you enter a word with a large number of letters, an OutOfMemory exception is thrown.
Need:
  1. Get rid of OutOfMemory
  2. Speed ​​up the script

A function that returns a list of anagrams from a given word:
anagrams
def anagrams(s):
    n=len(s)
    if n == 1:
        return s
    sb=[]
    for i in range(n):
        sb.append(s[i])
        rval=anagrams(s[0:i]+s[i+1:n])
        if isinstance(rval,str):
            sb[i]=sb[i]+rval
        if isinstance(rval,list):
            c=sb[i]
            sb[i]=[c + rval[x] for x in range(len(rval))]
    if(n==2):
        return sb
    else:
        c=[sb[x][h] for x in range(len(sb)) for h in range(len(sb[x]))]
        return c


A function that compares two lists and returns matches:
find_word
def find_word(list,source):
    L = []
    for item in list:
        if item in source and item not in L:
            L.append(item)
    if L:
        print('Matches found', len(L))
        print(L)
    else:
        print('No matches found')


Python 3.4.2
Tell me, tell me, explain because your experience is not enough for this(

Answer the question

In order to leave comments, you need to log in

5 answer(s)
T
throughtheether, 2014-11-04
@half-life

A function that returns a list of anagrams from a given word
I recommend taking a closer look at the itertools module, in particular, the permutations functions. Sample code:
import itertools
def anagrams(word):
  for permutation in itertools.permutations(word):
    yield ''.join(permutation)

for word in anagrams('car'):
    print(word)
car
cra
acr
arc
rca
rac

In the case of repetition of letters in the word, anagrams will also contain duplicates:
>>> for word in anagrams('rar'):
  print word
rar
rra
arr
arr
rra
rar

A function that compares two lists and returns matches
If I understood correctly, you keep lists of anagrams and dictionary words in memory and look for (linear search) their intersection. This, in my opinion, is not very effective. I would do like this:
words=['car','arc','cat','map','toster']
wordset=set(words)
for word in anagrams('car'):
   if word in wordset:
        print ("word %s matched vocabulary" % word)

If this is not enough, then it will be possible, in my opinion, to think about using Bloom filters.
UPD : As I understand it, the main problem is the number of anagrams.
We enter the word
From the entered word we make anagrams
From the very beginning, for each word from the dictionary, you can remember its 'sorted' version:
words=['car','arc','cat','map','toster']
sortedwordset=set(''.join(sorted(w)) for w in words)
>>> sortedwordset
set(['acr', 'eorstt', 'amp', 'act'])
Then for each word entered, you can check whether it makes sense to make anagrams:
if ''.join(sorted(word)) in sortedwordset:
    #continue with anagrams

UPD2 : It is possible, in my opinion, to do this: for each word from the dictionary, its 'sorted' form is formed. This form will be the key of the dictionary, and the value will be a list of dictionary words that are anagrams of this form. Then, due to preliminary calculations, it will be possible to quickly search for dictionary anagrams:
def sorted_string(s):
  return ''.join(sorted(s))

words=['car','arc','cat','map','toster']
d={}
for word in words:
  sorted_word=sorted_string(word)
  if sorted_word in d:
    d[sorted_word].append(word)
  else:
      d[sorted_word]=[word]
>>> d
{'acr': ['car', 'arc'], 'eorstt': ['toster'], 'amp': ['map'], 'act': ['cat']}
>>> d.get(sorted_string('car'),[])
['car', 'arc']
>>> d.get(sorted_string('cat'),[])
['cat']
>>> d.get(sorted_string('perkele'),[])
[]

A
Andrey Dugin, 2015-02-17
@adugin

The calculation of anagrams (permutations) is not required here (paragraphs 2 and 3 are misleading).
Therefore, the speed of execution of the code below does not depend on the length of the word.
Based on 20,000 words, it's instantaneous even with 'antidisestablishmentarianism':

from collections import Counter
from itertools import ifilter

def criteria(dictword):
    return (
        wlen == len(dictword) and
        wset == set(dictword) and
        wcnt == Counter(dictword)
    )

while True:

    word = raw_input('\nEnter word: ')
    wlen, wset, wcnt = len(word), set(word), Counter(word)

    with open('thesaurus.txt') as f:
        thesaurus = (line.rstrip() for line in f)
        for dictword in ifilter(criteria, thesaurus):
            print dictword

    if word in {'exit', 'quit'}:
        break

S
Sergey Lerg, 2014-11-04
@Lerg

You don't have to store all the anagrams in one array: do a permutation, check in the word base, then the next permutation. This will get rid of the OutOfMemory problem. But it will add a little time, as there will be repetitions.
Then you can break the word base into 28 separate bases, where only words of a certain length are stored. Should speed up the search a bit.
But how to make it faster - I would use another programming language (Go, C, Java) and try to parallelize the process into threads.

A
Alexander Wolf, 2014-11-04
@mannaro

First, prepare your dictionary.
1) Create a new hash.
2) We sort through the entire dictionary.
2.1) We split the current word into letters, sort them, glue them together.
2.2) Check if there is such a key in the hash.
2.2.1) If not, then create hash[key] = [] (empty array)
2.2.2) If yes, then add the current word to the array.
3) Preparation is over. We save the current hash and use it everywhere.
4) Comparison function: we take a word, we break it by letters, we sort it, we stick it together.
5) We check the presence of this key in our dictionary. We output the result.
JS example: jsfiddle.net/6bz8g9gz Python
example:

dict = ['wolf','flow','hello','world','folw','jack','open','close','nepo','peno','kill'];

def prepare(d):
  hash = {};
  for v in d:
    v = v.lower()
    list = [c for c in v]
    list.sort()
    result = "".join(list)
    if hash.has_key(result):
      hash[result].append(v)
    else:
      hash[result] = [v]
  return hash

def test(word, d):
  word = word.lower()
  list = [c for c in word]
  list.sort()
  result = "".join(list)
  if d.has_key(result) and len(d[result]) != 0:
    return d[result]
  else:
    return False

d = prepare(dict)
print test("WolF", d)

N
Nick V, 2014-11-04
@half-life

Thanks to all. Solution found. Everything works quickly and as intended. Special thanks to throughtheether and Sergey Lerg
Sasha Thank you too, although your example did not start for me.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question