C
C
CityzenUNDEAD2022-03-22 13:22:22
C++ / C#
CityzenUNDEAD, 2022-03-22 13:22:22

How to process 10gb text file?

Good day!
For the 2nd day I have been trying to find options on how to process a huge text file.
The essence of the task is as follows:
I receive a huge text xml file about 10GB in size. The
file has the following structure:

<organization typeof="Organization" about="http://opendata.trudvsem.ru/7710538364-organizations/organizations.xml#315910200403678">
    <region rel="dc:references" resource="http://opendata.trudvsem.ru/7710538364-regions/regions.xml#9100000000000"/>
    <name property="name">АЛИМЕНКО ДМИТРИЙ НИКОЛАЕВИЧ</name>
    <creationDate>2022-03-05</creationDate>
    <legalName>АЛИМЕНКО ДМИТРИЙ НИКОЛАЕВИЧ</legalName>
    <companyStructureHidden>false</companyStructureHidden>
    <ogrn>315910200403678</ogrn>
    <inn>910504080415</inn>
    <addressCode>9100000000000</addressCode>
    <firstRateCompany>Не относится к крупнейшим компаниям</firstRateCompany>
    <businessSize>SMALL</businessSize>
    <source>EMPLOYMENT_SERVICE</source>
    <innerInfo>
      <codeExternalSystem>CZN</codeExternalSystem>
      <dateModify>2022-03-13</dateModify>
      <deleted>false</deleted>
      <isModerated>true</isModerated>
      <moderationTime>2022-03-13</moderationTime>
      <registrationStatus>Получена по интеграции</registrationStatus>
      <status>Одобрено</status>
      <disableImportInfo>false</disableImportInfo>
      <disableImportVacancy>false</disableImportVacancy>
      <disableJoinCompany>false</disableJoinCompany>
      <disableJoinManager>false</disableJoinManager>
    </innerInfo>
  </organization>
<organization>
...
</organization>

That is, this file lists a huge number of organizations.
I need to process this file, and implement that each organization would be in a separate file.
That is, split this huge file into a large number of small files.
The difficulty lies in the fact that it is impossible to process this file in its entirety.
You need to somehow read this file in pieces, that is, read a piece, write it to a file, read the trace. a piece.

Question - is it possible to read the file only up to the first text that comes across </organization>, select this piece of data to write to the file, and continue reading from the stopped place?
Maybe there are other options for solving the problem, but so far I have only thought of the one described above, that is, that only small processed pieces of data would be kept in the process memory. Only I don't know how to implement it.

Answer the question

In order to leave comments, you need to log in

1 answer(s)
R
rPman, 2022-03-22
@CityzenUNDEAD

2 options
correct complex - google: stream xml parser c#
the first result
is the second simple and stupid - if the organization tag is one of the elements of a huge list and the file is formatted (and this can be done by other streaming means, for example, the console editor of the sed regex, inserting translations to a new line after closing the organization tag or in your program), then you can quickly load each organization into its own line by searching for a substring or by line-by-line loading of the file and analyze it with already familiar non-stream parsers

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question