How to parse a file line by line, catching and overwriting my pattern?

N

Night Tarlis2015-12-22 06:35:19

Python

Night Tarlis, 2015-12-22 06:35:19

import re

f1 = open("/home/tarlis/ParserTest/2.txt", 'r')
f2 = open("/home/tarlis/ParserTest/1.txt", "a")

fr = f1.read()
reg_pattern = 'title=\"(\D+)\"\D*data=\"([a-z.][email protected][mailstbknox]+\.ru)'
for line in fr:
    match = re.search(reg_pattern, line)
    if match is not None:
        f2.write(match.group(1) + '|' + match.group(2) + '\n')
f2.close()
f1.close()

Match always returns None, although I checked the regular expression on regex101.com, everything is fine. The file is being read line by line... I don't understand what the problem is :(
The content of the file being read is something like this:

<div>

    <a  target="_blank"   " title="Дмитрий" data="[email protected]">Дмитрий </a>
</div>

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

A

Alexander, 2015-12-22
@tarlis

f1.read() - read the file at once in a line. The subsequent loop will traverse the string.
f1 is iterable, you can do something like this

for line in f1:
   match.re.search(r, line)
   ...

Or read lines via f1.readlines() into a list and then iterate over it.
I would recommend that you parse html using specialized libraries, such as lxml or pyqyery.