F
F
FeRViD2014-03-20 17:25:25
Python
FeRViD, 2014-03-20 17:25:25

Why is Python many times inferior to Perl in terms of speed and memory consumption when parsing logs?

It became interesting to test python and pearl for speed when working with large files. For the test, 2 small scripts were written, the essence of the scripts is to read the log and create a hash in which the key is ip (the first field in the log line), and the value is all other requests from this ip.
The nginx log was used as a log, in which the fields are separated by :%%:.
The log file size is 1GB.

#!/usr/bin/perl -w

open(F, "</var/logs/access.log");

while (<F>) {
 ($ip, $d) = split(/:%%:/, $_, 2);
 

 if (!(exists $host{$ip})) {
  $host{$ip} = {};
  $host{$ip}{data} = '';
 }

 $host{$ip}{data} .= $d; 

}

time ./speed_test.pl
real 0m14.713s
user 0m7.240s
sys 0m2.060s
#!/usr/bin/python3.2

fd = open('/var/logs/access.log', 'r')

host = {}

for line in fd:
    ip, d = line.split(r':%%:', 1)
  
    if ip not in host:
      host[ip] = {}
      #host[ip]['data'] = []
      host[ip]['data'] = ''
  
    #host[ip]['data'].append(d)
    host[ip]['data'] = host[ip]['data'] + d

time ./speed_test.py
Traceback (most recent call last):
File "./speed_test.py", line 16, in
host[ip]['data'] = host[ip]['data'] + d
MemoryError
real 6m35.528s
user 3m13.940s
sys 3m19.828s Memory overflowed
at the 6th minute...
If you use a list to store subsequent lines (uncomment commented lines), then memory overflow occurs much earlier:
time ./speed_test.py
Traceback (most recent call last):
File "./speed_test.py", line 7, in
for line in fd:
MemoryError
real 0m25.717s
user 0m7.016s
sys 0m2.168s
I can't figure out why python is showing such disgusting results...

Answer the question

In order to leave comments, you need to log in

6 answer(s)
T
throughtheether, 2014-03-20
@throughtheether

I'm a bit of a python expert, but I guess the main resources are wasted here:

host[ip]['data'] = ''
...
host[ip]['data'] = host[ip]['data'] + d

Why are you using a dictionary as the value of the external dictionary keys (host variable)? Try doing this:
host[ip] = ''
...
host[ip] = host[ip] + d

V
Vladimir, 2014-03-20
Ulupov

you can try this with a list

from collections import defaultdict
host = defaultdict(list)
for line in open('1.24gb.log'):
    ip, d = line.split(' ', 1)
    host[ip].append(d)

1.24gb log
on Python 2.7.6 - 6.7 sec
on Python 3.4.0 - 9.7 sec

F
FeRViD, 2014-03-20
@FeRViD

As I understand it, in python, the operation of merging a string into a longer one is very expensive, so expensive that it needs to be abandoned. And it is not clear why the memory overflows in the end, and in pearl everything is ok - 20% (because 4 GB of RAM).
But using a list doesn't solve the problem either.

F
FeRViD, 2014-03-21
@FeRViD

#!/usr/bin/python3.2
from collections import deque, defaultdict

host = defaultdict(deque)
with open('/var/logs/access.log', 'r') as f:
    for line in f:
        ip, d = line.split(r':%%:', 1)
        host[ip].append(d)

time ./speed_test.py
Traceback (most recent call last):
File "./speed_test.py", line 6, in
for line in f:
MemoryError
real 0m17.845s
user 0m7.096s
sys 0m2.440s

S
snowpiercer, 2014-08-12
@snowpiercer

Strings in python are immutable, so
a new string is created and copied on each call.

P
Ptktysq, 2014-12-17
@Ptktysq

In perl, only the first and last lines of the loop body are needed - the rest is slow garbage.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question