M
M
Mikhail Faito2017-02-14 09:04:25
Python
Mikhail Faito, 2017-02-14 09:04:25

How to parse large text in Python?

Good afternoon, there is such initial data:
a text file (txt) is given, the data in which is located in this form (formatting is done with spaces):

Иванов Иванов    (rus)                            ООО "Белое и пушистое"
Ivanov Ivan           (en)                             White and Fluffy LLC
                                                                 Москва, Кремль, офис №15

There are a huge number of such records (the total weight of the document is about 50 megabytes).
The task is to give the utility "White and fluffy" or "White and Fluffy LLC" and "rus" or "eng" as input, and it will respond with the first and last name in the desired language.
The problem is the weight of the original text document.
What would you recommend for more or less fast parsing of such a file?
As it turned out, an important note: we do not edit this file ourselves, it is sent by another organization (state), it is impossible to change its format.

Answer the question

In order to leave comments, you need to log in

4 answer(s)
A
Anton fon Faust, 2017-02-14
@bubandos

It is not clear why this is done in python? Test?
Collect in a dictionary so that the keys are "White and fluffy" and "White and Fluffy LLC" and inside there is another dictionary with the keys "rus" and "eng", and the data respectively were "Ivanov Ivanov" and "Ivanov Ivan".
But, in general, it is better to make a parser that will add data to the database one time, and a client that will connect to the database and select the necessary data depending on the request. It will work quickly and reliably.

I
iSergios, 2017-02-14
@iSergios

However, frogs. However, the cactus
The solution is so simple that posting the finished code is somehow indecent. Dig towards the .readline() method
In general, in order not to prick in the future and make life easier for people, I would listen to advice about the database if I were you. Add once to the database (you can sqlite), write a simple guevina to the database for entering, deleting and searching. Half day work.

X
Xander017, 2017-02-14
@Xander017

For example, for what is now in the question:
inp = input()
f = open("yourfile", "r")
for line in f:
if inp in line:
fio = line.split(" ")
print(fio[0 ] + " " + fio[1])

A
abcd0x00, 2017-02-15
@abcd0x00

What would you recommend for more or less fast parsing of such a file?

You have to move this into a database that has facilities for SQL queries. Now you need to find this here for this, and tomorrow you will need to find a completely different one for a completely different one. And for all this (all possible options), it is precisely the flexible query language that suits. Therefore, you need to write a source data migration to a good database, which you then link to a script that checks if this source data file has changed in order to automatically build a new database. And in order to write it, you must first prepare the source file (remove extra spaces), then analyze it into separate records (this is a lexical analyzer), and then save this stream of tokens as records in the database table. But in the database, you should already have everything smartly done, so that any search can be performed and nothing is confused. It is even possible that you will need to make different tables (for Russian and for English and link them). A lot of work.
And you volunteered to make the job easier? In vain, they still do not appreciate. You will work for free for a diploma and a pat on the back.
If possible, don’t go into such things at all, let them search with their hands, at least with their feet. Say that this is how it should be, and do useful things yourself so as not to degrade.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question