How to parse a large number of logs?

L

Leks2015-11-23 07:17:15

linux

Leks, 2015-11-23 07:17:15

Good day.
What is the best way to approach the task of parsing a large volume of csv call logs (~ 12-15 GB) in order to get the maximum data processing speed?
Logs are a set of data - "time, name, duration".
As a result, you need to get the total duration for each unique name.
Such a script in pyhton was considered to be > 2 hours, I would like it to be faster:

import sys
import re

d = {}

for line in sys.stdin:
                        NameRE = re.compile("NAME=(\w+)")
                        TimeRE = re.compile("TIME=(\d+)")
                        if NameRE.search(line):
                                Name = str(NameRE.search(line).group(1))
                                Time = int(TimeRE.search(line).group(1))
                        if Name in d:
                                Time += d[Name]
                                d[Name] = Time
                        else:
                                d[Name] = Time
for k in d:
        print '%s  %s' % (k, d[k])

Reply

Answer the question

In order to leave comments, you need to log in

8 answer(s)

K

Kirill, 2015-11-23
@kshvakov

In DB put
For MySQl LOAD DATA INFILE ( dev.mysql.com/doc/refman/5.7/en/load-data.html)
For PostgreSql COPY ( www.postgresql.org/docs/9.3/static/sql-copy.html )
Or you can in SQLite

N

neol, 2015-11-23
@neol

- re.compile needs to be taken out of the loop.
- make one out of two regular expressions and instead of three search calls make one.
- remove useless transformation str(NameRE.search(line).group(1))

P

protven, 2015-11-23
@protven

Look towards Apache Spark. I took a course on it in the summer
https://courses.edx.org/courses/BerkeleyX/CS100.1x...
where one of the first labs was just the task of parsing Apache logs.
Spark is, first of all, much more human-loving than Hadup and easier to set up.
Secondly, due to the fact that you can store all the data in memory, it will most likely be faster if you allocate machines with enough RAM for it. In general, I would advise you to spend a couple of hours studying, in the course that I threw off, a ready-made Vagrantfile is given. Download Vagrant itself, then Virtualbox, do vagrant up and you have a ready-made environment, you can try to solve your problem.

S

Stalker_RED, 2015-11-23
@Stalker_RED

habrahabr.ru/company/dca/blog/267107

S

Swartalf, 2015-11-23
@Swartalf

A bit off topic, but if you're on Linux and need to iterate over large csv's, wouldn't it be easier to use awk?
The increase in speed will be very, very noticeable.

B

beduin01, 2015-11-23
@beduin01

Read the article, there are examples tech.adroll.com/blog/data/2014/11/17/d-is-for-data...

R

Roman Mirilaczvili, 2015-11-23
@2ord

If the question is about the code, then in addition to neol I can add for
d[Name] = Timesome reason it is also present in both if, else branches, which is not optimal.
If the question is about choosing the right tool, then as an alternative to your own script, you can also try the same Apache Spark.

S

Sergey Wegner, 2015-11-23
@wegners

Good afternoon, if there are no secret data in the logs, could you post a script for testing? And what kind of result do you want?