J
J
Johnny Mastricht2015-09-18 17:51:05
linux
Johnny Mastricht, 2015-09-18 17:51:05

How to properly parse access.log with python?

Hello. There is a task to generate the table containing time and response codes for requests. For example:

Time, 200, 30x, 40x, 50x
10/Aug/2015:12:12:01, 7, 2, 0, 1

That is, for the specified time, there were 7 requests with a response of "200", 2 - of the form "302" or "304", and so on.
The access.log line has the following pattern:
8.8.8.8 - - [10/Aug/2015:12:12:01 +0000] "GET /robots.txt HTTP/1.1" 200 708 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; + www.google.com/bot.html) "
In python, I generated the following code:
#!/usr/bin/python
import re
for string in open("access.log","r"):
 ws = string.split()
 i=0
 r=0
 n=0
 x=0
 matchi = re.findall(r'^2..',ws[8])
 if matchi:
  i=i+1
 matchr = re.findall(r'^3..',ws[8])
 if matchr:
  r=r+1
 matchn = re.findall(r'^4..',ws[8])
 if matchn:
  n=n+1
 matchx = re.findall(r'^5..',ws[8])
 if matchx:
  x=x+1
 print ws[3].replace('[',''), i, r, n, x

At the output I get:
10/Aug/2015:12:12:01 1 0 0 0
10/Aug/2015:12:12:01 0 1 0 0
10/Aug/2015:12:12:01 0 1 0 0
10/Aug/2015:12:12:01 0 1 0 0

Question: how now to add the values ​​\u200b\u200band output a unique time (what python tools or algorithm to use), that is, so that for this output example it turns out:
10/Aug/2015:12:12:01 1 3 0 0

Answer the question

In order to leave comments, you need to log in

2 answer(s)
A
Andrey Dugin, 2015-09-19
@JohnyMastricht

from collections import Counter, defaultdict
from itertools import imap
import re

codes = ['200', '3xx', '4xx', '5xx']
regex = re.compile('^.+?\[(?P<date>.+?) .+?\].+?(?P<code>\d+) \d+ ".+?" ".+?"$')
stats = defaultdict(Counter)

with open('access.log', 'r') as f:
    for date, code in (match.groups() for match in imap(regex.match, f) if match):
        stats[date].update([code if code == '200' else '{}xx'.format(code[0])])

for date, items in sorted(stats.iteritems()):
    print date, ' '.join(str(items[code]) for code in codes)

# ---------- И ещё вариант ----------

from collections import Counter, defaultdict
from itertools import imap
from operator import methodcaller as mc
import re

codes = ['200', '3xx', '4xx', '5xx']
regex = re.compile('^.+?\[(?P<date>.+?) .+?\].+?(?P<code>\d+) \d+ ".+?" ".+?"$')
stats = defaultdict(Counter)

def fmt(code):
    return code if code == '200' else '%sxx' % code[0],

with open('access.log', 'r') as f:
    reduce(
        lambda _, (date, code): stats[date].update(fmt(code)),
        imap(mc('groups'), imap(regex.match, f)), None
    )

for date, items in sorted(stats.iteritems()):
    print date, ' '.join(imap(str, imap(items.__getitem__, codes)))

</code>

S
sim3x, 2015-09-18
@sim3x

Better look for something like
https://github.com/etsy/logster
https://github.com/lebinh/ngxtop
https://github.com/lethain/apache-log-parser

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question