M
M
mrgloom2014-07-18 09:26:41
Python
mrgloom, 2014-07-18 09:26:41

How to count the number of occurrences of lines in a file?

Let's say there is a text file where each line is a word, words can occur several times in the file, you need to count the number of occurrences of each word in the file.
The problem is that there can be several hundred million words, i.e. text file >4GB in size.
1.What is the best way to store data instead of a text file?
2.In fact, one could use std::map to calculate, but what if everything does not fit into memory? (i.e., for good, I would like std:: map which, as it were, lies on the disk)
In fact, the task looks like {id, name, surname}, which, as it were, hints at SQL, but I don’t really want to mess with it (i.e. extra binding and I didn’t work with it almost), and how many such records can it pull and how easy is it to count the number of occurrences using SQL?
also if someone can suggest a solution in python it would be even better.

Answer the question

In order to leave comments, you need to log in

6 answer(s)
M
mrstrictly, 2014-07-18
@mrgloom

"Several millions" for modern amounts of RAM is not a problem. It makes sense to clarify the restrictions imposed on your program. Have you tried to solve it "on the forehead"? Are you doing premature optimization? :)
If the volume of the input text is really not limited from above, then this task looks, for example, like one of the classic examples of map-reduce (I am by no means talking about frameworks, hadups and other zookeepers, but about the idea), which boils down to: split the input stream into N fragments of a fixed size (for example, one million lines each), count the number of words in each fragment independently (map step), obtaining N key-value sets at the output (where, key is a word, value is the number of occurrences) , then sum these sets (reduce step). If the number of keys at the output of map is again huge (which I can hardly imagine for "natural" languages), you can shard intermediate results when the map step outputs not one continuous file, but K fragments (for example, the first one is words in "ac", the second - to "df" etc.). Here's a little more on that:michaelnielsen.org/blog/write-your-first-mapreduce...

R
RPG, 2014-07-18
@RPG

C++11 + unordered_map. Not a meteor, but map consistently outperforms and enough to solve the problem.
In general, this is how it is solved in Bash:
If the structure is complex, then you first need to select the key from the file, for example like this:

$ cut -d: -f7 /etc/passwd | sort | uniq -c
      2 /bin/bash
      1 /bin/sync
      1 /sbin/halt
     34 /sbin/nologin
      1 /sbin/shutdown

E
Eugene, 2014-07-18
@EvgenijDv

std::map might be fine. A million records for a modern amount of RAM is not so much :-)
With SQL it will be very easy to count the number of occurrences, but IMHO this is an overhead for this task :-)

T
tsarevfs, 2014-07-18
@tsarevfs

You are clearly prematurely optimizing. The solution in python, absolutely head-on, works fast enough on a file with 10M words of 8 characters. By the way, this is only 100 megabytes. Even if there are 10 times more words, memory is enough.

import random, string
from collections import defaultdict as ddict

def randomword(length):
   return ''.join(random.choice(string.lowercase) for i in range(length))

def main():
  f = open('a.txt', 'w')
  for i in range(10000000):
    f.write(randomword(8) + '\n')

  f.close()
  print('gen finished')

  d = ddict(int)
  stat = ddict(int)
  f = open('a.txt', 'r')
  for w in f.readlines():
    d[w] += 1
    stat[d[w]] += 1
    if d[w] > 1:
      stat[d[w] - 1] -= 1

  print stat




if __name__ == '__main__':
  main()

B
Boniface, 2014-07-18
@Boniface

Good afternoon. This is a trivial task. Use regular expressions or ready-made functions to find the number of occurrences of a substring in a string. For example php substr_count.

D
DancingOnWater, 2014-07-18
@DancingOnWater

In the case of millions of map options - a bad decision - there will be too many collisions.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question