How to extract data on a specific delimiter?

K

ksvdon2015-05-27 00:31:36

Python

ksvdon, 2015-05-27 00:31:36

I originally solved the problem in bash with cut,sed. But this thing works for about an hour. handles bulky files. Decided to convert to Python.
the processed text might look something like this:

слова какие-то 4234 цифры буквы что угодно символы - + 
и табличка:
|текст   |452 | цифры | пробелы    |

What I wrote already - looks at the first character of the string. If the first character is "|", then we have the desired string (table element) in front of us and it needs to be processed. Ideally, I want to get a list of lists (each row of a table is a list in a general list "table"), where the elements of the list are will be without spaces and without empty elements. And if some element between the separators in the table is empty, then it would just have to be skipped, if the word / number between the separators did not fill the entire space between the separators and there is a gap - it would be necessary to get rid of the gap. But in practice I get something like a list of lists I would have to somehow provide that I need the elements of the lists: "any characters between the separators, except for space and except for the void." But here's something I still can not catch up with how to describe it myself. I'm not familiar with python regexps and in general I don't know much about python yet.

#!/usr/bin/python

import sys
import re
import os

towrl = sys.argv[1]

dodestlip = []
destlip = open(towrl, "r")
dodestlip = destlip.readlines()
length1 = len(dodestlip)
destlip.close()

respa = []
for I in range(length1):
    mregexp = re.compile( '^\|' )
    if len(mregexp.findall( dodestlip[I] )) != 0:
        mregexp = re.compile( r"[|]" )
        respa.append(mregexp.split( dodestlip[I] ))
print respa

What are the options?

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

S

Stanislav Fateev, 2015-05-27
@ksvdon

Need more examples on your inputs and what should come out.
I think re is redundant here. Try like this:

#coding: utf-8

row = u'|text   |452 | digits |     |'
cells = [cell.strip() for cell in row.split('|') if cell.strip()]
print cells

Result:
['text', '452', 'digits']