B
B
beduin012022-03-29 21:18:38
Parsing
beduin01, 2022-03-29 21:18:38

How to parse data from docx?

Suppose I have a bunch of documents with a very fuzzy structure.
Let's say these are complaints from some department. They have a certain common outline, but the structure and number of sections may vary. Somewhere something can be missed, somewhere missing.

I need to extract parts of sections from documents and more or less group data in them.

There are questions in the approaches. Purely theoretically, I think you can unpack docx, look for the necessary sections in the target XML, bite them out, then send them to some kind of grid.

Or is it possible without this step? And take it apart right away? Is ML needed here or can some parsing rules be described without it?

If ML then how to automate learning? So that the model understands which parts need to be bitten out?

Answer the question

In order to leave comments, you need to log in

2 answer(s)
V
Vasily Bannikov, 2022-03-29
@vabka

It seems to me that while it is possible without ML, it is better to cope without it.
The format, of course, for docx is not the simplest - after all, it describes the appearance of the document, and not the formal structure, but it's still at least something.
At a minimum, you can pull out bare text without formatting and continue working with it.

D
Dmtm, 2022-03-31
@Dmtm

text formatting + keywords + their order, neural network you can try
two different tasks here, two neural networks
1) section boundary recognition
2) section type recognition

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question